Skip to content

Setting up macOS 10.15 Catalina for Data Science (part 1)

I’m on the idea that if you’re really into Data Science, linux should be your operating system. But whatever reasons you may have to use macOS instead, who am I to question them? Because of the UNIX roots on both operative systems, with some tweaks here and there, macOS could be a very sweet middle ground between user friendliness and having the power and simplicity of linux when it comes to working with Data.

This guide, divided in two parts will take you through the steps I followed on a backup laptop I own, when recently a group of fellow teachers, asked me for a way of setting up a macOS native environment as similar as possible to the one I provided them on a xUbuntu Virtual Machine for a basic Python – Data Analysis Course.

Step 1: Xcode in it’s latest version

Most of the tools we will be needing are installed trough Homebrew. This tool compiles from the formulae which are Ruby scripts written in a Homebrew’s domain specific language. Because of this the Xcode compiler is required.

Head to the apple developers page, login with your apple id, download and install Xcode; at the time of this writing the latest available was beta 3 of version 12. Perhaps this is the most time consuming step in this guide; for me the download was 10.2 GB and on my system, Xcode reclaims about 27.7 GB.

Step 2: Get ready for some brews!

Homebrew, known by many just as brew is a package manager for macOS, much in the style of what in the linux world is apt, dnf or zypper. To install it, open the terminal and enter the following command:

% /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

If you are new to this tool and want to learn more from what we will use in this guide, a quick guide and a nice cheat sheet with the most used commands are available in this page.

Step 3: A clean, stable and unintrusive Python 3 install

There are several ways to have a Python to it’s latest version in our system, nevertheless not planning ahead can eventually take us to a scenario so chaotic that turns out to be funny (xkcd confirms this). There’s a reason I’m not fond of the Anaconda approach, and it’s that it has made a mess of a couple of linux systems in the past.

Moshe Zadka author of several Python tutorials, has helped a significant amount of developers to perform it’s activity in a safe and consistent way on macOS following one core principle:

“The basic premise of all Python development is to never use the system Python. You do not want the Mac OS X ‘default Python’ to be ‘python3.’ You want to never care about default Python”

Moshe recommends using pyenv. This tool help us manage different python environments with specific python versions as required. The easiest way to install it is using Homebrew:

% brew install pyenv

Only with the purpose of getting to know which is the latest stable python version available, we head to [Python’s official download page. As of this writing, it is version 3.8.5. Again, to be clear, we are not downloading anything from here. We will use pyenv to install Python as follows:

% pyenv install 3.8.5

Now that we have installed python latest version trough pyenv, we will set it as our default global version for our pyenv environments:

% pyenv global 3.8.5

Next, we can check this worked:

% pyenv version
3.8.5 (set by /Users/lperez/.pyenv/version)

Up to this point, if we execute python on the shell we would get this output:

WARNING: Python 2.7 is not recommended. 
This version is included in macOS for compatibility with legacy software. 
Future versions of macOS will not include Python 2.7. 
Instead, it is recommended that you transition to using 'python3' from within Terminal.

Python 2.7.16 (default, Jul  5 2020, 02:24:03) 
[GCC 4.2.1 Compatible Apple LLVM 11.0.3 (clang-1103.0.29.21) (-macos10.15-objc- on darwin

The power of pyenv arises from the control it has over our $PATH variable and therefore, releasing us from the pain of caring which is the “default” Python version on the system. To enhance this, we must modify our zsh configuration file:

% brew install nano
% nano /Users/$USER/.zshrc

Wether or not the .zshrc file is empty, we must add to the end of it’s content the following:

## managing python trough pyenv

if command -v pyenv 1>/dev/null 2>&1; then
  eval "$(pyenv init -)"
fi

We exit nano by pressing CTRL + X . Confirm that the changes must be saved with Y and finally press Enter.

To make this change efective we pass the following command:

% source /Users/$USER/.zshrc

Now let’s verity that whenever we call Python, the interpreter used will be the one we set as global for pipenv

% python
Python 3.8.5 (default, Jul 24 2020, 17:29:19) 
[Clang 11.0.3 (clang-1103.0.32.62)] on darwin

Et voilà! With the help of pyenv we have a clean, nice and secure installation of Python. AS you see fit, you can use pyenv to manage different Python versions and dependecies for your projects.

Credit: The steps on this section are a resume of the excellent publication from Matthew Broberg, available on opensource.com

Step 4: pipenv and some of the most important libraries for DataScience

Python doesn’t come out of the box with everything you may need. Again, I’m not a fan of the Anaconda aproach of installing a BIG bunch of libraries just in case some day you need them. I also think that the fact of installing x library when the necessity arises, helps you gain insight on the inner workings of Python and therefore boosting your confidence while working with it.

Pip is the standard package manager for Python that allows you to easily install most likely any library you will ever need. With pipenv we are dodging pretty much the same issue as with pyenv: we don’t want to mix up our system python libraries with the ones we want to install in the sight of potential headaches.

As you guessed it, to install pipenv we issue the following command:

% brew install pipenv

If you already have a pyenv environment active – as we do because of the last step – pipenv will detect this and install your libraries within this environment. Therefore when in the past you may have used something like:

% pip install --user <package_name>

Now, you will simply use:

% pipenv install <package_name>

This are my commonly used libraries, but most likely you will find the need for several more down the road:

pipenv install numpy scipy pandas jupyter jsonschema csvkit pg8000 psycopg2-binary sqlalchemy matplotlib seaborn plotly beautifulsoup4 html5lib lxml requests pandoc pillow xlrd xlwt

Closing thoughts

On this part of the guide we basically set Python (in a proper and clean way) within our system as the foundation of our Data Science working environment. In the upcoming second part of this guide we will cover other tools that in my personal experience, are essential while working with data.

Published inGuides

Be First to Comment

Hey There! Let me know what you think about this post...