Prepping for Work

Welcome to this short introduction to the world of Data Analysis with Python! Let’s not waste any time; lets get ourselves set up to get right to work.

Before we start to write any code or do any computer setup, let’s make sure we’ve got a couple things:

  • A text editor we trust (I use Sublime Text 3, but Atom is also good.)
  • Python is installed on our machines
  • Python’s pip command is available for use

Got Python?

To check if you actually have access to Python, open a new shell in Terminal and type

$ which python

In the command line. Your shell should return something that looks like one of these

/usr/bin/python
/Library/Frameworks/Python.framework/Versions/2.7/bin/python
/Users/your_name_here/some/path/to/python

If instead that command has returned nothing..

For OSX

You can use the Homebrew package manager to install Python or update to the most recent version. If you don’t have Homebrew, use the link to retrieve it and install on your machine. Then you can obtain Python 2 and 3 with the following:

$ brew install python

$ brew install python3

For Linux

For Linux machines you can use your apt-get command to retrieve and install a new version of Python or to udpate to the latest version.

$ sudo apt-get update

$ sudo apt-get install build-essential python3-dev python3-venv python python-dev

Similarly, check to see if your distribution of Python has the pip command. To see if you have it and that it’s available to you, type

$ which pip

That’ll be useful shortly when we install all of the data science-related Python packages that we may need. If you don’t have pip, download the installer from here. Then, wherever you downloaded that get-pip.py file, run the following:

$ python get-pip.py

A New Virtual Environment

As a developer, especially as a Python developer, one often finds that different projects require different libraries. Even as an analyst you may want to use different versions of Python packages for different projects to reproduce someone else’s results.

To separate our different Python worlds, we use Virtual Environments. A virtual environment isolates your current project from the rest of your computer. They’re great for when you want to set environment variables, install packages for only one thing, etc.

Installing and Activating Virtual Environments

With Python 3, it’s simple to create a new virtual environment. Navigate to the directory where you want to work, and type the following:

$ python3 -m venv .

This command will set up a new virtual environment wherever you are located within your file system. You can also use this command to make a new directory and set up a new environment in that directory like so:

$ python3 -m venv data_analysis

That combines commands to make a new directory and set up a virtual environment within it.

To activate your new virtual environment, navigate to the project directory (if you’re not there already) and type

$ source bin/activate

This will change what your terminal looks to when looking for things like environment variables, Python, or any Python packages. Your command line should also change a bit, depending on how your system’s .shellrc or .shell_profile file is set up. Mine prepends my new environment name to my command line.

(data_analysis)$

Install the Packages

For tonight’s work, there are a variety of Python packages that we will need to use. Instead of having you hunt for them, I’ve provided those packages and their dependencies in the form of a requirements.pip file. You can download that file here: link .

Now, use the wonder of Python to obtain all of those packages.

(data_analysis)$ pip install -r data_science_requirements.pip

Create the Notebook

Tonight’s work will be performed in the Jupyter Notebook ecosystem. Jupyter Notebook provides a great environment for interspersing code and thought processes. It’s also great for experimenting with code and seeing how changes in certain inputs affect the script.

We start Jupyter Notebook from the command line.

(data_analysis)$ jupyter notebook
[I 17:48:46.769 NotebookApp] Serving notebooks from local directory: /Users/Nick/Documents/codefellows/courses/data_analysis_glance/code
[I 17:48:46.769 NotebookApp] 0 active kernels
[I 17:48:46.769 NotebookApp] The Jupyter Notebook is running at: http://localhost:8888/
[I 17:48:46.769 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).

Your terminal will return the above messages, then pop open a new tab in your browser at port 8888.

Within the tab you’ll see a new interface with its own menu showing:

  • Files
  • Running
  • Clusters

We’ll just be concerned with what’s in the Files tab for now, as that is where our notebook will appear once we’ve created it.

To actually go about creating a notebook, click New and choose Python 3 (or Python 2)

Using Jupyter Notebook

Once one opens you’ll see what looks like an iPython prompt, as well as a bunch of other stuff:

  • The title of the notebook (currently Untitled)
  • The File/Edit/View/Insert/Cell/Kernel/Help menu
  • A sub-menu for copying, pasting, saving, and navigating the notebook
  • A dropdown menu (currently titled Code), allowing you to change what inputs are rendered in the notebook.

In your terminal you’ll see that after about 2 minutes your notebook will save itself. Currently it’ll be as Untitled.ipynb. The file name will change depending on the notebook’s title (e.g. changing the title to “Flerg the Blerg” changes the filename to Flerg the Blerg.ipynb).

When you open a new notebook you’re started off with an empty cell. Within this cell you will write code as you would in the command line. The written code automatically gets syntax highlighting. You execute the code within a cell by holding Shift and pressing Enter.

In [1]: print("Hello World")
Hello World

If executed code prints to stdout, that output appears below the cell. You can of course write multiple lines of code, because otherwise it’d just be silly.

You can write code in your notebook as you would any script file. If you try to write code blocks, it will auto-indent for you. Some of the same commands for applying comments or indentations in Sublime (or Atom) are present here.

If you need to check the documentation of an object or method, type the following

In [2]: object_or_method?

Jupyter Notebook will pop up a mini-window from the bottom of your screen containing the top-level documentation for that object or method. You can also get more detailed documentation by typing

In [3]: help(object_or_method)

Jupyter will print the detailed documentation for you below that cell in a scrollable field.

One important difference between Jupyter Notebook and the iPython interpreter is that the order in which code is executed can change and is extremely important. Consider the line numbers on the left side. The higher the number, the more recently that cell has been executed. Changing the code in a previous cell changes its number. In order for that change to propagate through the rest of your code, you have to re-run cells that would follow that one. We’ll see that later. This allows for experimentation with your code without having to re-run an entire script. It can be dangerous if you don’t maintain an understanding of how your code is working.

Because these notebooks were modeled after how scientists write and think in their own notebooks, you have the option of being able to write text and/or Markdown in cells amongst your code. This is accomplished via the dropdown menu at the top currently entitled Code. Select a cell, click on that menu, and change the cell’s format to Markdown. Then you can write code in Markdown as you please, and render that Markdown when you execute the cell. This is handy for writing down thoughts as you try out code.

We’ll be using it to write whatever the hell we want moving forward.

Next Up: The Data