In November 2020, DAGsHub gave a series of guest lectures to the excellent Y-DATA course for aspiring data scientists, which we would now like to share with whoever finds it useful, in blog form!
Cut to the chase!
- Shell basics
- Background on shells
- Shell variables
- Runnable files
- Package managers (e.g. brew and apt)
- Shell commands inside Jupyter notebooks
- Text editors in the terminal
- Other useful commands
- Running commands in the background
- Symbolic links
The topic - system, IT, DevOps, MLOps, whatever other name you want to call it - how do you make the computer do what you want, outside the context of Python (or R or Matlab etc., we don't discriminate)? How do you get that beautiful neural network of yours to run on an actual server in the cloud, so that it can serve actual users?
What to do when the bubble bursts, and you have to step outside the Jupyter notebook to fix things?
Of course, this is a wide open question which requires a lot of previous knowledge to answer. In our lectures, we wanted to start by building a solid foundation for the students to stand on. So, we went to the classics - what is Linux? Why do people use it? What is Bash? How to use the terminal? How to exit vim?!
Instead of creating yet-another-tutorial on how to move and copy files in terminals, we wanted to bring perspective:
- Why would you use Linux, Bash, and other system tools?
- What's the smart way to do it, based on our subjective experience?
- What common problems will you come across, and how to solve them?
- What's the mental framework for working with these tools, to gain understanding and learn more by playing?
So, this guide/cheatsheet is more about our tips and tricks, and is definitely not exhaustive. On the contrary - we wanted to make the most of students' time, and only talk about what's interesting. Other things can be learned on an as-needed basis.
Who is this for?
The curriculum and some of the tips are aimed at data scientists who want an introduction to the topics of Linux & Bash. However, the data science orientation mainly comes into play in a few domain specific tips, and in the stated motivations to learn these things - if you're an aspiring web developer, there's no reason not to benefit from this guide as well!
What is linux?
- A family of open source operating systems.
- Developed by Linus Torvalds, who also invented Git to manage the source code for Linux.
- An operating system is a program that takes over a bit after your computer turns on.
- For the first few seconds after your computer switches on, the motherboard runs a small hard-coded operating system called the BIOS, but it quickly hands control over to some operating system kernel, which is installed on one of the hard drives, a USB stick or CD.
- From that point on, the kernel decides which programs to run when, and how to control physical devices (via drivers).
- An operating system is a bundle of programs that come packaged together. The kernel is the most important part, but it comes with more programs which help the users communicate with the kernel.
- e.g. File explorers are part of the OS, but not the kernel - they're just graphical interfaces which sit between the user and the kernel.
- Operating systems normally also handle file systems, user permissions, memory management, and many other things.
- The thing that unites all the different operating systems in the Linux family is they all use the same Linux kernel - other parts differ. More on that later in the section about distributions.
What is Linux good for?
An operating system is, surprisingly, just a type of system. Systems are designed by humans, and better designs lead to better performance, stability, and flexibility. Linux is simply a better designed operating system. It's super flexible and stable - "blue screens of death" are exceedingly rare in production Linux servers, and their performance is very reliable. Which is why a vast majority of production systems run on Linux, and that's also why it's good for anyone working in tech to be Linux literate. That includes you, dear reader.
Being open source leads to high quality, as bugs have fewer dark places to hide in. Developers can peer under the covers to make sure their Linux applications will work well, rather than guessing and relying on questionable documentation from closed source operating system developers.
But with great power and flexibility comes a great ability to shoot yourself in the foot. Linux makes that easy as well.
What do the different types of Linux mean?
One of the confusing things when entering the Linux world is the giant jargon which is thrown in your face. It feels like the explanations expect you to already know and understand a bunch of other terms, without building understanding step by step. So, I'd like to give you a very brief summary of terms you might come across and what they mean.
Mac and Unix are very similar to, but are not Linux technically. You will have a hard time telling the difference, unless you dive deep.
Unix is older than Linux and extremely similar - In fact, Linux is an open source re-implementation of Unix (which was closed source, but very good). This is pretty much historic trivia, as Unix is rarely seen nowadays, but know that some people use the words Unix and Linux interchangeably.
In general, there’s a name for operating systems that look and feel like Unix – POSIX compliant, or *nix. When you see these words, translate them as “follows the conventions of Linux, such as basic commands for file manipulation (ls, cd, mkdir) and "/" as the root of the file system etc.”
GNU is a large set of free software which is the foundation for much of Linux – compilers, C libraries, programs to zip files, and many others. It's also the name of an independent POSIX operating system, with more hardcore ideology around free software than Linux.
All of the above systems, as well as Linux itself, are examples of POSIX compliant or *nix systems.
Linux Distributions / Distros
There are (too?) many flavours of “real Linux”, called distros or distributions. It can be a headache to differentiate them.
A distribution is like a "company", which invents a new operating system. They wrap the Linux Kernel with a new bundle of peripheral programs - i.e. they may use a different mix of GUI programs, support different hardware by default, etc. They release new versions occasionally.
The bottom line – unless you know what you’re doing, just use Ubuntu. It’s the most user friendly, widely supported, and easy to install.
Red Hat Enterprise Linux, or RHEL, is a different distro which is used sometimes in heavy duty production servers. Fedora is the desktop equivalent of RHEL - usually, developers aiming to run their applications on RHEL servers will use Fedora for their development computers, to avoid compatibility issues.
Alpine is a super minimal distro which is used for many Docker images. Read our blog post about Docker for more information.
When people think of Linux, they usually associate it with a scary terminal (plus attached Anonymous hacker with a hoodie 👩💻).
Don't Panic – it’s not so scary! Today, it’s really easy to install Linux on a computer, with a regular GUI wizard, if you pick a distro that cares about that sort of thing (for example, Ubuntu).
We'll focus on terminals / shells in this lecture, since that is always available, and generally where "real work" is done. Production servers will rarely have GUIs. Don't let that discourage you - after you get used to it, using the shell can become much more convenient than GUIs!
The following actions are very basic file manipulation commands - moving, copying, deleting, viewing, etc.
I think there are enough sources online to learn these basic commands, and so I won't be re-explaining them here. Below the list, I provide my recommended way to learn about them, so don't worry!
The most convenient way I found to learn about these commands, even if you don't have a Linux terminal available, is to follow these tutorials:
Do up to and including lesson 3.
Webminal includes an interactive terminal in the browser, which you can play with and use for the next tutorial (which doesn't have an interactive shell, only text and quizzes).
The shell (AKA terminal) is itself a program! It's in charge of things like:
- Taking keystrokes from the user
- Displaying text output to the user.
- Remembering what directory you're in currently (changed using
cd, shown using
- Turning your commands into new running programs - processes, by sending appropriate messages to the kernel
For example, what does the shell do when I type
python hello_world.py and press enter?
- It's in charge of knowing where the actual program called "python" is located in the file system - probably something like
/usr/bin/python. In the end, the kernel is the only thing that can run new programs, and it expects absolute paths to files.
- I can check where the shell is actually finding "python" by running
which python. The
whichcommand outputs the full path found by the shell. How does it know? More on that later, in the section on runnable scripts.
whichis a useful command! Maybe you have several conflicting versions of python installed, and you're not sure which one is actually running and giving you problems.
which pythonto the rescue!
- Or maybe I have some runnable script, and I want to edit, delete or rename it, but I forgot where it's located.
whichto the rescue!
- So, what actually happens is that the shell tells the kernel program: "Please take the program file located at
/usr/bin/python, and turn that into a new running process with a single argument
/absolute/path/to/hello_world.py, running inside the current directory. Also, please send all of that processes' outputs back to me (the shell process) so I can show them to the user".
The shell also does things like expand paths - e.g.
~, the current user's home folder, is something that only the shell knows about, not the kernel. When you type
cd ~/projects, the shell translates that to
Shells also have many more features - variables, ifs, loops. Shell scripting is a full programming language.
But I would recommend not writing anything complicated in them - try to offload complexity to "real" programming languages. Shell scripts can be very convenient, but hard to read, maintain, and debug.
A useful thing to know - you can use the boolean operators,
||, the same way you would use them in a programming language:
python script_that_fails.py || echo "Error! :(" # Error! :( python script_that_succeeds.py || echo "Error! :(" # python script_that_fails.py && echo "Success! :)" # python script_that_succeeds.py && echo "Success! :)" # Success! :)
The shell knows whether a command succeeds or not based on its "exit code". Any code other than zero signifies failure. Some commands return different exit codes to signal what went wrong. To see the exit code of the last command, use
echo $?. I won't go into more detail about this in this blog.
There are many different shell programs available in the world, but the most universal ones are sh (Bourne Shell) and its improved successor bash (short for Bourne Again SHell). Personally, I use
zsh, as it has more advanced features that I enjoy, while being backwards compatible to
bash. To see my recommended setup with
zsh, see the end of this guide.
If you think about it, desktops and other GUIs are actually like shells that take the mouse into account and draw pixels to windows, instead of characters to a terminal!
To illustrate - when you double-click
hello_world.py in a file explorer, it does something very similar to what I described above when I run
You can define and use variables like so:
# Setting a variable is this simple: R2=D2 # Then, wherever you use $R2 in a future command, the shell will # replace $R2 with the variable's value, in this case "D2" echo $R2 # D2
Sometimes, it's useful to take the output of one command and use it in the body of another command. For example, let's say you want to say hello to yourself. You could do it in one of a few ways:
# Backticks - the command inside the backticks is executed, # and its output string is inserted into the command instead. # In this case, the whoami command outputs your username: ME=`whoami` echo Hello $ME! # Hello username! # Can also be used directly in one step echo Hello `whoami`! # Hello username! # More general than backticks and works the same - # you can use $() and even nest it echo Hello $(echo $(whoami), and goodbye)! # Hello username, and goodbye!
Pipes allow you to to connect the output of one command as the input to another command.
Stringing these together can result in some real magic. Examples:
# Print the first 10 lines of a file cat bigfile | head -10 # Print lines 7-10 of the file cat bigfile | head -10 | tail -3 # Comfortably read all of bigfile with a text reader called "less", # which lets you scroll up and down and do searches cat bigfile | less
Using pipes to create complex logic from simple parts is the core philosophy of *nix. See this short and sweet blog post for a good explanation.
Every process has 3 standard streams:
- Standard input -
- Standard output -
- Standard error -
Usually, any warnings or errors will print to
stderr. This is done to keep
stdoutclean, so that piped commands get the actual data you want them to work on, without any garbage like deprecation warnings.
In fact, pipes connect one process'
stdoutto the next process'
There are a few more important operators:
> which connect these streams directly to files.
# Print lines 7-10 of the file # This is functionally the same as the above example with cat. # cat reads bigfile and outputs it to cat's stdout, which is connected to # head's stdin. # The < operator just directly connects the bigfile as head's stdin! head -10 < bigfile | tail -3 # The > operator connects echo's stdout to the file called numbers. # It will create the file if it doesn't exist, and overwrite it if it exists. echo 123 > numbers cat numbers # 123 # Same, but >> doesn't overwrite existing files, it appends to them echo 1.618 >> numbers cat numbers # 123 # 1.618 # If you want to run some program and log all of its output to a file, # you should use &> instead of > (or &>> instead of >>) . # &> redirects both stdout AND stderr to the file. # Otherwise, you might not see warning messages and errors in the log, which sometimes are the most important parts! python3 my_long_training_script.py &> log.txt
You will often want to use
&>> when running long processes while you're doing other things - for example, model training or data processing! This will allow you to review the logs after the processes finishes or crashes, and try to debug if things go wrong.
ctrl+cif you want to stop the currently running command (no, it does not mean "copy" in this scenario).
# Waits 314 seconds. Who has the time?! sleep 314 # Now click ctrl+c and regain control of your life!
ctrl+c is often written as
^ sign often signifies "control".
*nix systems make all files, devices, disks, etc. available under one filesystem which starts at the root:
/ . Unlike Windows, there is no
Under the root folder, by convention you will see some folders with standard names like
/var, etc. These are interesting only when trying to do more advanced things, so I'll leave them out of this guide. Here is a separate guide, if you're curious what is the meaning behind this conventional folder structure.
You should only be doing stuff inside your own home folder in
/home/username/ unless you know what you're doing. Inside your home folder, it's normally a good and safe idea to create some new folder to contain all of your projects and playgrounds, and confine yourself to that folder.
Paths are case sensitive! Unlike Windows. Try extremely hard to avoid creating two files or folders which have the same name in different cases. If you later try to copy those files to a Windows machines, you will suffer.
mkdir example cd Example # cd: no such file or directory: Example mkdir Example cd Example # OK!
Hidden files and folders - start with
. - Use
ls -la to see them. I prefer to always use
ls -la , so I've made a shortcut for it!
alias l="ls -la"means that whenever I run
lit does what I want quickly!
Avoid spaces in file or folder names!!! Use dashes or underscores for a happier life. Otherwise, you will have to do workarounds:
mkdir "folder with spaces" cd folder with spaces # cd: too many arguments cd "folder with spaces" # OK!
Typical scenario - you write or download a shell script or program, try to run it by typing the file name, and get an error to the face:
command not found: script.sh
script.sh is right there in the current working directory! 🤨😡🥺😭😞
This is because the shell uses one of 2 ways to figure out where the program you want to run is located:
- If you specify the program file with not just its name, but also the path to it (relative or absolute), then the shell understands what you want and tries to run it. So
- If you didn't specify a path, the shell searches for
script.shinside all of the directories listed inside a special shell variable called
You can check what your
PATH is by running
echo $PATH and getting something like this:
That's a list of folders separated by
:. The shell looks in them one-by-one until it finds an executable file called
In this case it fails to find it, since our current directory is not in the
PATH. In the case of
python, it succeeds since
/usr/bin/ IS in the
You can add a new folder to the
I advise making a
bin directory in your home directory, and adding it permanently to the
To add permanently to the
PATH , you typically need to add a line to the
.bash_profile is a script which gets executed every time bash starts, so it becomes a permanent change.
Here is a simple way to add a new folder to the
# Make sure to use >> and not > !!! # You don't want to delete your existing .bash_profile echo 'export PATH=/home/username/bin:$PATH' >> ~/.bash_profile
So, you figured out your previous mistake, try to run
./script.sh again and get:
permission denied: ./script.sh
chmod to make the
script.sh file executable!
Annoying, I know, but this is instead of the Windows convention of using
.exe - usually programs don't have any suffix in Linux, so you have to explicitly say they're runnable and not just some random collection of bytes.
ls -l to see file permissions (and owners). Example output:
-rwxr-xr-x 1 root root 0 Nov 25 21:55 executable -rw-r--r-- 1 root root 30 Nov 25 21:54 script.sh drwxr-xr-x 2 root root 4096 Nov 25 21:55 subdir
The left-most column shows file types and permissions.
The first letter just indicates the type of file - typically directory
d or normal file
Then comes the mysterious
rwxrwxrwx . This indicates Read, Write, eXecute permissions, repeated 3 times for user, group, everyone. Generally you only care about the first 3 letters - permissions for your own user.
So, for example,
-rwxr-xr-- indicates a normal file, that your own user can do anything with, other users in your group can only Read or eXecute, and everyone else can only Read.
chmod +x script.sh means "allow me and everyone else to eXecute this script", or in other words, make it runnable.
You can also change file ownership with
chown and there are more clever ways to use
chmod, but we leave that as an exercise to you.
There's also another way to execute shell scripts - instead of running them in a new process, you can also instruct your current shell process to read those script files line-by-line and run them as if you typed them in. This is done in one of 2 equivalent ways:
. script.sh(Note that it's a dot followed by a space!)
So if you ever see a mysterious dot in a shell script, know that this is what it does. This is often done to set variables in your current shell session - e.g. change the
PATH for you. If you had run the script as a new process, it wouldn't be able to change YOUR variables.
Shell scripts (and
.py scripts, etc.) are not actually runnable programs - they're just text. They need an interpreter program to read them and decide what actual machine code to run on the CPU.
So when you run
script.py other args,
what actually gets executed is
python script.py other args.
This is done by guessing the Python interpreter based on the
Shebangs make this guess explicit, and usually are not necessary. But that's what it means when a script file starts with a single line
In the second example, it explicitly instructs to use Python 3 and not 2.
DO NOT DOWNLOAD AND RUN RANDOM FILES FROM THE INTERNET LIKE A SAVAGE
You will get hacked, not to mention an unmanageable mess.
Almost always, you can use the operating system distribution's package manager to download safe versions, which you can then easily upgrade at your convenience as new versions are released. Here is a list of what the common package managers are called:
- Ubuntu - apt. Also comes with good GUI wrappers.
- Fedora or Red Hat - yum.
- Mac - brew. Also has an optional UI.
- Alpine - apk
- For Windows, yes, Windows - Chocolatey! Use it, it's awesome.
In case you didn't know, there are convenient ways to execute shell scripts from inside Jupyter notebooks. This is often useful when you want to do things like download files, create directories, clone Git repos, and many other scenarios.
For example - what if you try to use some deep learning library inside the notebook, but then you find out it won't work until you update GPU drivers or download a different version of Python? Or you need the opencv C libraries installed? These are a few possible examples out of an infinite possibility space, where knowing how to escape the bubble of the notebook can really make you more effective.
root user, sudo
Some actions, such as installing packages, will require root permissions. Root is the name of the superuser in Linux, which is allowed to do anything. Anything, including completely and irreversibly borking the operating system!
Usually, you can't (and shouldn't) login as root - instead, you use
sudo command to run that command as the root user.
sudo means "superuser do" or "switch user do". Of course, it will only work if your user is allowed to
sudo - this is called "being a sudoer". In a private computer, your normal user should be a sudoer.
If you want to do a lot of actions as the root user, you can open a whole shell session as the root user by running
Be careful not to shoot yourself in the foot! Mistakes done as the root can permanently ruin your operating system.
Text editors in the terminal
Try to use nano if it exists - it's relatively intuitive, behaves like notepad. But more systems come with vim included.
When using nano, just type what you want in the file, then click
Ctrl+O to save or
Ctrl+X to quit.
Instructions appear on the bottom, and as I mentioned earlier, ^ means Ctrl.
vim (or vi)
vim is the embodiment of an incredibly powerful tool, to blow off your own foot.
Most *nix systems will have it installed, so it's good to know how to use it in emergencies. If you know how to use it well, it can be a superpower and completely replace your IDE. It has separate modes for typing text, moving around, and typing command. Fun!
For hours of amusement, Google image search: How to exit vim!
Tip: try using the command
:syntax on to get nice syntax highlighting.
(Seriously though, vim is not that scary and very useful. Take a look at vim adventures!)
ps auxw shows most useful information on the currently running processes.
top as a "task manager" to see what's slowing down your computer.
If a process is giving you trouble and you want to stop it, kill it!
- Find its process ID with
kill 123where 123 is the process ID
- That sends a signal to the process called SIGTERM, which means "please shut yourself down cleanly"
- SIGTERM is also the signal sent when you do
- Of course, that doesn't always work
- If you give up on the process, use
kill -9 123to 100%, guaranteed, murder the process with ID 123.
-9 means SIGKILL instead of SIGTERM.
Just want to quickly kill the python process?
pkill python will find it (and ALL processes named python) by name.
pkill -9 python will murder all of them.
Want to make sure no other python process is running before resorting to murder?
pgrep python will find them for you. If it only returns one process, then murder away.
Where to find help
- Dr. Google!
man- official manual pages for commands.
For example, type
man chmodto get help on the
Usually easier to read in a browser than inside the shell.
- tldr.sh - like man pages for humans. Highly recommended!
- https://github.com/nvbn/thefuck - automatically corrects mistakes, like typos or forgetting
Other useful commands
grep- find things inside files
env- show all environment variables
historyshows you your command history
- Up and down arrows let you re-run commands from your history
ctrl+ris a superpower! Lets you search in your previous commands instead of clicking the up arrow a million times to find a specific command you ran a long time ago
df -h- see your free disk space
du -h- see the disk size of current directory and all its children
du -h -d 1is a more friendly way to find and delete whatever is taking up space
free- shows memory usage.
Only look at
total, used, availablecolumns. The rest are not relevant.
alias- Create your own power tools! Build a library of aliases in
alias l="ls -lah"and
wc -l- count number of lines in a file or standard input
Very useful commnads to know! Give you powers like the back & forward buttons of your browser, but for directories.
curlto download web pages or files from web servers.
wgetis actually better than chrome and firefox at downloading large files, and resuming the download if it failed in the middle.
wgetcan be used easily to scrape sites, follow links, etc. when gathering data.
curlis great for testing HTTP APIs.
Running commands in the background
Running a command followed by
& at the end makes it run in the background:
# Run these 2 commands in parallel: python long_training_job.py > log.txt & python second_long_training_job.py > log.txt & # Run other commands while the previous two commands are running. # You'll get a notification in the terminal when they finish.
ctrl+z sends the currently running command to the background and pauses it.
bg then makes the command resume running in the background, while you can do other things.
jobs lists any commands you sent to the background:
sleep 123 & sleep 456 & sleep 789 & sleep 999 & jobs #  running sleep 123 #  running sleep 456 #  - running sleep 789 #  + running sleep 999
kill %3 kills job number 3
Critical for real life work. It's mainly used to connect to a remote machine and start a shell session there, but it can do so much more!
It's even possible to render GUI programs through SSH: https://www.youtube.com/watch?v=tPrwFAswhE8
Good luck doing something like that on any other operating system!
tmux allows you to do split-screen terminals, to run several terminals and see their outputs at once.
Also, if you SSH into a server and then run
tmux will keep running your commands on the server even when you disconnect! You can then reconnect to the machine, and reconnect to your
tmux session, as if nothing happened!
SSH port forwarding
If a web server is running on the remote machine but its port is blocked by a firewall, but you have SSH access:
ssh -L 10000:localhost:20000 server.com
will expose the remote machine port 20000 on your own machine's 10000!
Hackers create chains like this to hide where they're originally coming from.
Think of symbolic links like shortcuts, except they work really transparently with shell commands and programs, instead of only working when you double click on them with a mouse. Really useful if you want something to look like it's copied somewhere else, without actually copying it. Can be critical for working with large data that you don't want to copy!
Remember - the order of argument is like
ln -s original-file-or-folder where-you-want-the-link-to-be
It's safe to then delete it. It only deletes the link, not the original file:
rm -f where-you-want-the-link-to-be
Zsh + Oh-my-zsh + powerlevel10K
My personally recommended setup for shells. It's beautiful, has many nifty and convenient features, and a suite of plugins.
- Zsh is just an alternative shell to bash.
- Oh-my-zsh is a framework to install and manage Zsh, which comes with a ton of cool plugins and features.
- Powerlevel10k is a fast and beautiful visual theme for Zsh,
Phew! This has been a hell of a ride. I hope I helped you get started on the right foot in your journey into Linux and working with the terminal. If you stay curious and practice, you can achieve levels of wizardry that non-terminal users have only heard of in legends. Have fun!