PA 2.5: Data Framework

No description has been provided for this image No description has been provided for this image

CEGM1000 MUDE: Week 2.5. Due: complete this PA prior to class on Friday, Dec 13, 2024.

Overview of Assignment

This assignment quickly introduces you to the package pandas. We only use a few small features here, to help you get familiar with it before using it more in the coming weeks. The primary purpose is to easily load data from csv files and quickly process the contents. This is accomplished with a new data type unique to pandas: a DataFrame. It also makes it very easy to export data to a *.csv file.

If you want to learn more about pandas after finishing this assignment, the Getting Started page is a great resource.

In [ ]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Part 1: Introduction to pandas

Pandas dataframes are considered by some to be difficult to use. For example, here is a line of code from one of our notebooks this week. Can you understand what it is doing?

net_data.loc[net_data['capacity'] <= 0, 'capacity'] = 0

One of the reasons for this is that the primary pandas data type, a DataFrame object, uses a dictionary-like syntax to access and store elements. For example, remember that a dictionary is defined using curly braces.

In [ ]:
my_dict = {}
type(my_dict)

Also remember that you can add items as a key-value pair:

In [ ]:
my_dict = {'key': 5}

The item key was added with value 5. We can access it like this:

In [ ]:
my_dict['key']

This is useful beceause if we have something like a list as the value, we can simply add the index the the end of the call to the dictionary. For example:

In [ ]:
my_dict['array'] = [34, 634, 74, 7345]
my_dict['array'][3]

And now that you see the "double brackets" above, i.e., [ ][ ], you can see where the notation starts to get a little more complicated. Here's a fun nested example:

In [ ]:
shell = ['chick']
shell = {'shell': shell}
shell = {'shell': shell}
shell = {'shell': shell}
nest = {'egg': shell}
nest['egg']['shell']['shell']['shell'][0]

Don't worry about that too much...as long as you keep dictionaries and their syntax in mind, it becomes easier to "read" the complicated pandas syntax.

Now let's go through a few simple tasks that will illustrate what a DataFrame is (when constructed from a dictionary), and some of its fundamental methods and characteristics.

Task 1.1:

Run the cell below and check what kind of object was created using the method type.

In [ ]:
new_dict = {'names': ['Gauss', 'Newton', 'Lagrange', 'Euler'],
            'birth year': [1777, 1643, 1736, 1707]}

YOUR_CODE_HERE

Task 1.2:

Run the cell below and check what kind of object was created using the method type.

In [ ]:
df = pd.DataFrame(new_dict)

YOUR_CODE_HERE

Task 1.3:

Read the code below and try to predict what the answer should be before you run it and view the output. Then run the cell, confirm your guess and in the second cell check what kind of object was created using the method type.

In [ ]:
guess = df.loc[df['birth year'] <= 1700, 'names']
print(guess)
In [ ]:
YOUR_CODE_HERE

Note that this is a Series data type, which is part of the pandas package (you can read about it here). If you need to use the value that is stored in the series, you can use the attribute values as if it were an object with the same type as the data in the Series; the example below shows that the names in the DataFrame is a Series where the data has type ndarray.

In [ ]:
print(type(df.loc[df['birth year'] <= 1700, 'names']))
print(type(df.loc[df['birth year'] <= 1700, 'names'].values))
print('The value in the series is an ndarray with first item:',
      df.loc[df['birth year'] <= 1700, 'names'].values[0])

Another useful feature of pandas is to be able to quickly look at the contents of the data frame. You can quickly see which columns are present:

In [ ]:
df.head()

You can also get summary information easily:

In [ ]:
df.describe()

Finally, it is also very easy to read and write dataframes to a *.csv file, which you can do using the following commands (you will apply this in the tasks below):

df = pd.read_csv('dams.csv')

To write, the method is similar; the keyword argument index=False avoids adding a numbered index as an extra column in the csv:

df.to_csv('dams.csv', index=False)

Task 2: Evaluate and process the data

For this assignment we will use a small files dams.csv file that is available in the repository for this PA.

Task 2.1:

Import the dataset as a DataFrame, then explore it and learn about its contents (use the methods presented above; you can also look inside the csv file).

In [ ]:
df = YOUR_CODE_HERE
YOUR_CODE_HERE.head()

Solution:

We can see that this dataset has some information about dams, including the name, year constructed, volume and height. They look pretty big! It's actually the largest 5 dams by either volume or height (10 dams total), listed on Wikipedia page here.

Task 2.2:

Using the example above, find the dams in the DataFrame that are of type earth fill.

In [ ]:
names_of_earth_dams = YOUR_CODE_HERE
print('The earth fill dams are:', names_of_earth_dams)

Hint: the answer should be: ['Fort Peck' 'Nurek' 'Kolnbrein' 'WAC Bennett']

Task 2.3:

Create a new dataframe that only includes the earth fill dams. Save it as a new csv file called earth_dams.csv.

Hint: you only need to remove a small thing from the code for your answer to the task above).

In [ ]:
df_earth = YOUR_CODE_HERE
df_earth.YOUR_CODE_HERE

Task 2.4:

Check the contents of the new csv file to make sure you created it correctly.

End of notebook.

© Copyright 2024 MUDE TU Delft. This work is licensed under a CC BY 4.0 License.