PA 2.5: Data Framework

No description has been provided for this image No description has been provided for this image

CEGM1000 MUDE: Week 2.5. Due: complete this PA prior to class on Friday, Dec 13, 2024.

Overview of Assignment

This assignment quickly introduces you to the package pandas. We only use a few small features here, to help you get familiar with it before using it more in the coming weeks. The primary purpose is to easily load data from csv files and quickly process the contents. This is accomplished with a new data type unique to pandas: a DataFrame. It also makes it very easy to export data to a *.csv file.

If you want to learn more about pandas after finishing this assignment, the Getting Started page is a great resource.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Part 1: Introduction to pandas

Pandas dataframes are considered by some to be difficult to use. For example, here is a line of code from one of our notebooks this week. Can you understand what it is doing?

net_data.loc[net_data['capacity'] <= 0, 'capacity'] = 0

One of the reasons for this is that the primary pandas data type, a DataFrame object, uses a dictionary-like syntax to access and store elements. For example, remember that a dictionary is defined using curly braces.

In [2]:
my_dict = {}
type(my_dict)
Out[2]:
dict

Also remember that you can add items as a key-value pair:

In [3]:
my_dict = {'key': 5}

The item key was added with value 5. We can access it like this:

In [4]:
my_dict['key']
Out[4]:
5

This is useful beceause if we have something like a list as the value, we can simply add the index the the end of the call to the dictionary. For example:

In [5]:
my_dict['array'] = [34, 634, 74, 7345]
my_dict['array'][3]
Out[5]:
7345

And now that you see the "double brackets" above, i.e., [ ][ ], you can see where the notation starts to get a little more complicated. Here's a fun nested example:

In [6]:
shell = ['chick']
shell = {'shell': shell}
shell = {'shell': shell}
shell = {'shell': shell}
nest = {'egg': shell}
nest['egg']['shell']['shell']['shell'][0]
Out[6]:
'chick'

Don't worry about that too much...as long as you keep dictionaries and their syntax in mind, it becomes easier to "read" the complicated pandas syntax.

Now let's go through a few simple tasks that will illustrate what a DataFrame is (when constructed from a dictionary), and some of its fundamental methods and characteristics.

Task 1.1:

Run the cell below and check what kind of object was created using the method type.

In [7]:
new_dict = {'names': ['Gauss', 'Newton', 'Lagrange', 'Euler'],
            'birth year': [1777, 1643, 1736, 1707]}

# YOUR_CODE_HERE

# SOLUTION
type(new_dict)
Out[7]:
dict

Task 1.2:

Run the cell below and check what kind of object was created using the method type.

In [8]:
df = pd.DataFrame(new_dict)

# YOUR_CODE_HERE

# SOLUTION
type(df)
Out[8]:
pandas.core.frame.DataFrame

Task 1.3:

Read the code below and try to predict what the answer should be before you run it and view the output. Then run the cell, confirm your guess and in the second cell check what kind of object was created using the method type.

In [9]:
guess = df.loc[df['birth year'] <= 1700, 'names']
print(guess)
1    Newton
Name: names, dtype: object
In [10]:
# YOUR_CODE_HERE

# SOLUTION
type(guess)
Out[10]:
pandas.core.series.Series

Note that this is a Series data type, which is part of the pandas package (you can read about it here). If you need to use the value that is stored in the series, you can use the attribute values as if it were an object with the same type as the data in the Series; the example below shows that the names in the DataFrame is a Series where the data has type ndarray.

In [11]:
print(type(df.loc[df['birth year'] <= 1700, 'names']))
print(type(df.loc[df['birth year'] <= 1700, 'names'].values))
print('The value in the series is an ndarray with first item:',
      df.loc[df['birth year'] <= 1700, 'names'].values[0])
<class 'pandas.core.series.Series'>
<class 'numpy.ndarray'>
The value in the series is an ndarray with first item: Newton

Another useful feature of pandas is to be able to quickly look at the contents of the data frame. You can quickly see which columns are present:

In [12]:
df.head()
Out[12]:
names birth year
0 Gauss 1777
1 Newton 1643
2 Lagrange 1736
3 Euler 1707

You can also get summary information easily:

In [13]:
df.describe()
Out[13]:
birth year
count 4.000000
mean 1715.750000
std 56.364143
min 1643.000000
25% 1691.000000
50% 1721.500000
75% 1746.250000
max 1777.000000

Finally, it is also very easy to read and write dataframes to a *.csv file, which you can do using the following commands (you will apply this in the tasks below):

df = pd.read_csv('dams.csv')

To write, the method is similar; the keyword argument index=False avoids adding a numbered index as an extra column in the csv:

df.to_csv('dams.csv', index=False)

Task 2: Evaluate and process the data

For this assignment we will use a small files dams.csv file that is available in the repository for this PA.

Task 2.1:

Import the dataset as a DataFrame, then explore it and learn about its contents (use the methods presented above; you can also look inside the csv file).

In [14]:
# df = YOUR_CODE_HERE
# YOUR_CODE_HERE.head()

# SOLUTION
df = pd.read_csv('dams.csv')
df.head()
Out[14]:
Name Year Volume (1e6 m^3) Height (m) Type
0 Tarbela 1976 153.0 143 rock fill
1 Fort Peck 1940 96.0 96 earth fill
2 Ataturk 1990 84.5 166 rock fill
3 Houtribdijk 1968 78.0 13 rock fill
4 Oahe 1963 70.3 75 rock fill

Solution:

We can see that this dataset has some information about dams, including the name, year constructed, volume and height. They look pretty big! It's actually the largest 5 dams by either volume or height (10 dams total), listed on Wikipedia page here.

Task 2.2:

Using the example above, find the dams in the DataFrame that are of type earth fill.

In [15]:
# names_of_earth_dams = YOUR_CODE_HERE
# print('The earth fill dams are:', names_of_earth_dams)

# SOLUTION
names_of_earth_dams = df.loc[df['Type'] == 'earth fill', 'Name'].values[:]
print('The earth fill dams are:', names_of_earth_dams)
The earth fill dams are: ['Fort Peck' 'Nurek' 'Kolnbrein' 'WAC Bennett']

Hint: the answer should be: ['Fort Peck' 'Nurek' 'Kolnbrein' 'WAC Bennett']

Task 2.3:

Create a new dataframe that only includes the earth fill dams. Save it as a new csv file called earth_dams.csv.

Hint: you only need to remove a small thing from the code for your answer to the task above).

In [16]:
# df_earth = YOUR_CODE_HERE
# df_earth.YOUR_CODE_HERE

# SOLUTION
df_earth = df.loc[df['Type'] == 'earth fill']
df_earth.to_csv('earth_dams.csv', index=False)

Task 2.4:

Check the contents of the new csv file to make sure you created it correctly.

End of notebook.

© Copyright 2024 MUDE TU Delft. This work is licensed under a CC BY 4.0 License.