Set up environment and load data

This tutorial will guide you through the first steps in data science. You will learn how to set up a working environment in DataSpell, load a data set, and create a Jupyter notebook.

Prerequisites

Before you start, make sure that:

You have installed DataSpell. This tutorial was created in DataSpell 2023.3.
You can download DataSpell and use all its features for free during a 30-day trial period. Also, consider taking part in the DataSpell Early Access Program.
You have Python 3.6 or newer on your computer. If you're using macOS or Linux, your computer already has Python installed. You can get Python from python.org.

Set up data science environment

Let's start with running DataSpell.

To run DataSpell, find it in the Windows Start menu or use the desktop shortcut. You can also run the launcher batch script or executable in the installation directory under bin.

Run the DataSpell app from the Applications directory, Launchpad, or Spotlight.

Run the dataspell.sh shell script in the installation directory under bin. You can also use the desktop shortcut if it was created during installation.

For more information about DataSpell initial setup, refer to Run DataSpell for the first time. When you are ready, click Launch DataSpell.

Create your first project

If you are on the Welcome screen, select a Projects tab and click New Project. If you have already got any project open, choose File | New Project from the main menu.
Let's create a pandas Tables project.
Choose the project location. Click Browse in the Location field and specify the directory for your project.
Also, deselect the Create sample Jupyter notebook checkbox because you will create a new Jupyter notebook for this tutorial.
You can also configure a Python interpreter for your project. For more information, refer to Set up a working environment.
When you are ready, click Create.

If you have already got a project open, after clicking Create DataSpell will ask you whether to open a new project in the current window or in a new one. Choose Open in current window - this will close the current project, but you will be able to reopen it later. For more information, refer to Open, reopen, and close projects.

Prepare data

Now it is time to get some data for research. In this tutorial, we will use the "Airline Delays from 2003-2016" dataset by Priank Ravichandar licensed under CC0 1.0. This dataset contains the information on flight delays and cancellations in the US airports for the period of 2003-2016.

We will load the data, analyze it, and find out which airport had the highest ratios of delayed and cancelled flights.

Add data to the project

Download the dataset from kaggle.com by using the Download link in the upper-right corner.
Extract airlines.csv from the archive and drag-and-drop it to the folder in your DataSpell project directory.
Click Refactor to confirm the operation:
Now airlines.csv is shown in the Project tool window. You can open it in the editor:

Let's create a Jupyter notebook:

Select the target directory in the Project tool window and do one of the following:
- Right-click the target directory and select New from the context menu.
- Press Alt+Insert
Select Jupyter Notebook.
In the dialog that opens, type a filename.

A notebook document has the *.ipynb extension and is marked with the corresponding icon.

The newly created notebook contains one empty cell:

Now let's import the pandas library and load airlines.csv. Click inside the notebook cell and type the following code:

            import pandas as pd

            # Read file
            data = pd.read_csv('airlines.csv')
        

If there is a red squiggly line under pandas, it means that this package is not available in your environment:

Place the caret at the highlighted expression and press Alt+Enter to reveal the list of available quick fixes. Choose to install the package:

The required package will be installed and the red curvy line will no longer be displayed.

If you see the following warning, you can either click the first link to use the system interpreter, or choose to configure a Python interpreter for your project (for more information, refer to "Creating a new virtual environment "):

Depending on the options selected during initial setup, you may need to install Jupyter:

Let's run the notebook. There are several ways to do that:

To execute all code cells in your notebook, click on the notebook toolbar.
To run just the current cell, press Ctrl+Enter.
When executing one cell at a time, mind code dependencies. If a cell relies on some code in another cell, that cell should be executed first.

DataSpell allows viewing variables during notebooks execution. To do that, open the Jupyter Variables tool window. Now there is a data variable that points to a pandas DataFrame structure:

Jupyter Variables tool window with data variable

To view the data as a table in a separate tab, click the View as DataFrame link.

You can also print out the data by typing the DataFrame name:

            import pandas as pd

            # Read file
            data = pd.read_csv('airlines.csv')

            # View DataFrame
            data
        

Press Ctrl+Enter to run the cell. The output is displayed under the code cell:

You can scroll the output cell. DataSpell will load and display the data dynamically.

Summary

Congratulations on completing this basic data science tutorial! Here's what you have done:

Created your first project
Downloaded a dataset and prepared it for research
Created a notebook and ran it for the first time

As a next step, learn to visualize data with matplotlib.

Last modified: 26 May 2024