DataSpell 2024.1 Help

Visualize data with matplotlib

In this tutorial, you will learn to visualize data by using the Matplotlib library. While learning, you will perform the following tasks:

  • Find out which US airport had the highest ratios of delayed and cancelled flights in 2003-2016.

  • See how that ratios changed over time.

  • Highlight some general trends.

Prerequisites

Before you start, make sure that:

  • You have installed DataSpell. This tutorial was created in DataSpell 2023.3.

  • You have Python 3.6 or newer on your computer. If you're using macOS or Linux, your computer already has Python installed. You can get Python from python.org.

Prepare an example

If you have completed the previous tutorial, just proceed to Transform data. Otherwise, download the data set and add it to the DataSpell workspace as described in the section Add data to the project.

In this tutorial, we will use the "Airline Delays from 2003-2016" dataset by Priank Ravichandar licensed under CC0 1.0. This dataset contains the information on flight delays and cancellations in the US airports for the period of 2003-2016.

You should have a notebook with the following cell:

import pandas as pd # Read file data = pd.read_csv('airlines.csv')

You can also download the full notebook and add it to your DataSpell workspace: https://www.jetbrains.com/help/dataspell/resources/ds_visualize_tutorial.ipynb

Transform data

Our first goal is to find the airport with the highest ratios of delayed and cancelled flights. For that purpose we need only the following columns:

  • Airport.Code

  • Statistics.Flights.Total (total number of flights)

  • Statistics.Flights.Delayed (number of delayed flights)

  • Statistics.Flights.Cancelled (number of cancelled flights)

Let's select the required columns from the data DataFrame and put them into airport_data by adding the following code to the first cell:

# Select required columns airport_data = data[['Airport.Code', 'Statistics.Flights.Total', 'Statistics.Flights.Delayed', 'Statistics.Flights.Cancelled']] # View DataFrame airport_data

Note that DataSpell provides automatic completion for column headers:

Auto completion of column headers names

This is what you'll see in the output after you run the cell by pressing Ctrl+Enter or clicking Run all:

The output of the newly created DataFrame

Scroll the rows down. You see that each row contains data for one month in one specific airport. We need statistics for the whole period, so let's group data by airport codes and summarize the values:

# Group data by airport code and summarize the values airport_data = airport_data.groupby('Airport.Code').sum()

Here is the full code of the cell:

import pandas as pd # Read file data = pd.read_csv('airlines.csv') # Select required columns airport_data = data[['Airport.Code', 'Statistics.Flights.Total', 'Statistics.Flights.Delayed', 'Statistics.Flights.Cancelled']] # Group data by airport code and summarize the values airport_data = airport_data.groupby('Airport.Code').sum() # View DataFrame airport_data

... and its output:

The output of grouped and summarized DataFrame

Before we continue, let's create another code cell. Click Add Code Cell Below in the notebook toolbar:

Adding a code cell below

We will compare the airports by the ratios of delayed and cancelled flights. Looks like we should add some data to the DataFrame. Let's create two new columns: Ratio Delayed.Total and Ratio.Cancelled Total. The data in those columns will result from calculations. The variables are introduced for code readability:

# Assign variables delayed = airport_data['Statistics.Flights.Delayed'] cancelled = airport_data['Statistics.Flights.Cancelled'] total = airport_data['Statistics.Flights.Total'] # Create new columns and fill them with values airport_data['Ratio Delayed.Total']= delayed / total airport_data['Ratio Cancelled.Total']= cancelled / total # View DataFrame airport_data

Run the cell to make sure that two columns have been added to the DataFrame:

Two additional columns in the DataFrame

Visualize data

Let's start with visualizing our newly created airport_data DataFrame in the form of a bar chart.

You need to add import matplotlib.pyplot as plt to the beginning of your first code cell. Don't forget to rerun that cell after editing, for example, by clicking Run cell and select below or pressing Shift+Enter.

Create a bar chart

  1. First of all, let's improve the readability of the future chart and sort the rows in airport_data by using the sort_values method:

    airport_data = airport_data.sort_values(by='Ratio Delayed.Total', ascending=False)
  2. Now assign the variables:

    airport_code = airport_data.index delayed = airport_data['Ratio Delayed.Total'] cancelled = airport_data['Ratio Cancelled.Total']

    Note that when we created the airport_data DataFrame, the Airport.Code column became the index column. It means that airport codes are used as row addresses in this DataFrame, and you shouldn't specify the column name to read them.

  3. The following code creates a figure with the specific width and height in inches, as well as a Matplotlib Axes. They will contain all the elements of the future bar chart.

    fig, ax = plt.subplots(figsize=(15,5))
  4. To plot a bar chart, use the bar() method. It accepts the x and y coordinates as the first two positional arguments. The label will be used to render the chart legend.

    ax.bar(airport_code, delayed[airport_code], bottom=cancelled[airport_code], label='Delayed') ax.bar(airport_code, cancelled[airport_code], label='cancelled')

    When plotting the first bar chart, we use an additional bottom parameter to stack the delays bars on top of the cancellations bars.

  5. Set the labels for axes, the chart title, and show the legend:

    ax.set_xlabel('Airport codes') ax.set_ylabel('Ratio') ax.set_title('Ratio of delayed and cancelled flights to total flights') ax.legend()
  6. Finally, use plt.show() to render the whole thing. This is the full code of the cell:

    # Sort DataFrame rows airport_data = airport_data.sort_values(by='Ratio Delayed.Total', ascending=False) # Assign variables airport_code = airport_data.index delayed = airport_data['Ratio Delayed.Total'] cancelled = airport_data['Ratio Cancelled.Total'] # Create a figure and set its size to 15x5 in. fig, ax = plt.subplots(figsize=(15,5)) # Plot bar charts ax.bar(airport_code, delayed[airport_code], bottom=cancelled[airport_code], label='Delayed') ax.bar(airport_code, cancelled[airport_code], label='cancelled') # Add axes labels and title ax.set_xlabel('Airport codes') ax.set_ylabel('Ratio') ax.set_title('Ratio of delayed and cancelled flights to total flights') # Show legend ax.legend() # Show plot plt.show()

    And here is the result:

    Bar chart with delays and cancellations ratios

You can see that the highest ratio of delayed flights was in the Newark Liberty International airport (EWR). Let's continue researching the data for this particular airport. It would be interesting to find out how the number of cancelled and delayed flights changed over time.

Create a line chart

  1. Let's start with selecting the necessary data from the dataset:

    ewr_data = data[data['Airport.Code']=='EWR']

    This code can be translated as "select the rows from data that have EWR in the Airport.Code column and put them into the ewr_data DataFrame".

  2. Assign the variables:

    date = ewr_data['Time.Label'] delayed = ewr_data['Statistics.Flights.Delayed'] cancelled = ewr_data['Statistics.Flights.Cancelled']
  3. Again, create a figure and an Axes:

    fig, ax = plt.subplots(figsize=(15,5))
  4. For line charts, use the plot() method:

    ax.plot(date, delayed, label='Delays') ax.plot(date, cancelled, label='Cancellations')
  5. Add the axes labels, the title, and the legend:

    ax.set_xlabel('Year/Month') ax.set_ylabel('Flights (delayed/cancelled)') ax.set_title('Cancellations and delays in EWR 2003-2016') ax.legend()
  6. Here is the full code cell for copy-pasting:

    # Select rows with 'EWR' in the first column ewr_data = data[data['Airport.Code']=='EWR'] # Assign variables date = ewr_data['Time.Label'] delayed = ewr_data['Statistics.Flights.Delayed'] cancelled = ewr_data['Statistics.Flights.Cancelled'] # Create a figure and set its size to 15x5 in. fig, ax = plt.subplots(figsize=(15,5)) # Plot line charts ax.plot(date, delayed, label='Delays') ax.plot(date, cancelled, label='Cancellations') # Add axes labels ax.set_xlabel('Year/Month') ax.set_ylabel('Flights (delayed/cancelled)') # Add title and legend ax.set_title('Cancellations and delays in EWR 2003-2016') ax.legend() # Show plot plt.show()

    And the output:

    The line chart of delays and cancellations in EWR

    There are too many ticks on the x-axis. To show only every 12th of them, put the following line before plt.show():

    ax.set_xticks(date[::12])

    It looks better now:

    The line chart of delays and cancellations in EWR

As you can see, the number of cancellations didn't change much over time. But there is a clearly visible fall in the number of delayed flights somewhere in the end of 2009. Did they hire a new manager? Or maybe that's somehow connected with the total number of flights? Let's check!

Show more data

  1. We will add another line chart with the total number of flights. But that numbers are far greater than delays. The solution is to add another Axes that will share the same x-axis but have its own y-axis. That can be done by using the twinx() method:

    total = ewr_data['Statistics.Flights.Total'] ax2=ax.twinx() ax2.plot(date, total, '--', color='g', label='Total flights') ax2.set_ylabel('Flights (total)') ax2.legend(loc='upper center')

    Note the third positional argument of the plot() method. Dashes will result in a dashed line. The color is also customizable. We also specified the location of the legend, so that it doesn't interfere with another one.

  2. The full code:

    # Select rows with 'EWR' in the first column ewr_data = data[data['Airport.Code']=='EWR'] # Assign variables date = ewr_data['Time.Label'] delayed = ewr_data['Statistics.Flights.Delayed'] cancelled = ewr_data['Statistics.Flights.Cancelled'] total = ewr_data['Statistics.Flights.Total'] # Create a figure and set its size to 15x5 in. fig, ax = plt.subplots(figsize=(15,5)) # Plot line charts ax.plot(date, delayed, label='Delays') ax.plot(date, cancelled, label='Cancellations') # Add axes labels ax.set_xlabel('Year/Month') ax.set_ylabel('Flights (delayed/cancelled)') # Add title and legend ax.set_title('Cancellations and delays in EWR 2003-2016') ax.legend() # Decrease the density of ticks on x-axis ax.set_xticks(date[::12]) # Plot another chart with extra y-axis ax2=ax.twinx() ax2.plot(date, total, '--', color='g', label='Total flights') ax2.set_ylabel('Flights (total)') # Add legend in center ax2.legend(loc='upper center') # Show plot plt.show()

    And the chart:

    The line chart of delays, cancellation, and total number of flights

So, there is no magic. The overall number of flights decreased significantly in the middle of 2008, which naturally resulted in fewer delays.

Keep researching

What happened in 2008? Did the number of flights decrease only in EWR? Let's find out by analyzing the whole dataset.

Now we are interested only in dates and numbers. Let's just group the rows of the data DataFrame by the time label in YYYY/MM format, and then summarize the values. There will be invalid values in some columns (like Time.Month and Time.Year), but we won't use them. All the rest is similar to the previous tutorial steps:

# Group data by date and summarize the values monthly_data = data.groupby('Time.Label').sum() # Assign variables date = monthly_data.index # Date total = monthly_data['Statistics.Flights.Total'] # Total number of flights # Create a figure and set its size to 15x5 in. fig, ax = plt.subplots(figsize=(15,5)) # Plot a line chart ax.plot(date, total) # Decrease the density of ticks on x-axis ax.set_xticks(date[::12]) # Add title and axes labels ax.set_title('Total number of flights in US airports 2003-2016') ax.set_xlabel('Year/Month') ax.set_ylabel('Flights') # Show plot plt.show()

And yes, there was a drop in total flights in all US airports:

Total number of flights in US airports 2003-2016

Speaking about drops, do you see that periodic negative peaks on the chart? Seems like they repeat every year. Let's build a chart to see how did the total number of flights change throughout the year.

This time we will group the data by month name. The sort_values(by='Time.Month') part is needed to range the months in chronological order:

# Group data by month, summarize values, and place months in correct order seasonal_data = data.groupby('Time.Month Name').sum().sort_values(by='Time.Month') # Assign variables month = seasonal_data.index total = seasonal_data['Statistics.Flights.Total'] # Create a figure and set its size to 15x5 in. fig, ax = plt.subplots(figsize=(15,5)) # Plot a line chart ax.plot(month, total) # Add title and axes labels ax.set_title('Total number of flights in US airport throughout the year') ax.set_xlabel('Month') ax.set_ylabel('Flights') # Disable scientific notation for y-axis ax.ticklabel_format(axis='y', style='plain') # Show plot plt.show()

Definitely, February is the low season for the US civil aviation:

Total number of flights in US airports throughout the year

Summary

You have completed the matplotlib visualization tutorial. Here's what you have done:

  • Selected necessary data and transformed it

  • Built bar charts and line charts

  • Discovered statistical trends

Last modified: 29 March 2024