PA 2.5: Data Framework¶
CEGM1000 MUDE: Week 2.5. Due: complete this PA prior to class on Friday, Dec 13, 2024.
Overview of Assignment¶
This assignment quickly introduces you to the package pandas
. We only use a few small features here, to help you get familiar with it before using it more in the coming weeks. The primary purpose is to easily load data from csv files and quickly process the contents. This is accomplished with a new data type unique to pandas: a DataFrame
. It also makes it very easy to export data to a *.csv
file.
If you want to learn more about pandas after finishing this assignment, the Getting Started page is a great resource.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
Part 1: Introduction to pandas¶
Pandas dataframes are considered by some to be difficult to use. For example, here is a line of code from one of our notebooks this week. Can you understand what it is doing?
net_data.loc[net_data['capacity'] <= 0, 'capacity'] = 0
One of the reasons for this is that the primary pandas data type, a DataFrame
object, uses a dictionary-like syntax to access and store elements. For example, remember that a dictionary is defined using curly braces.
my_dict = {}
type(my_dict)
Also remember that you can add items as a key-value pair:
my_dict = {'key': 5}
The item key
was added with value 5. We can access it like this:
my_dict['key']
This is useful beceause if we have something like a list as the value, we can simply add the index the the end of the call to the dictionary. For example:
my_dict['array'] = [34, 634, 74, 7345]
my_dict['array'][3]
And now that you see the "double brackets" above, i.e., [ ][ ]
, you can see where the notation starts to get a little more complicated. Here's a fun nested example:
shell = ['chick']
shell = {'shell': shell}
shell = {'shell': shell}
shell = {'shell': shell}
nest = {'egg': shell}
nest['egg']['shell']['shell']['shell'][0]
Don't worry about that too much...as long as you keep dictionaries and their syntax in mind, it becomes easier to "read" the complicated pandas syntax.
Now let's go through a few simple tasks that will illustrate what a DataFrame
is (when constructed from a dictionary), and some of its fundamental methods and characteristics.
Task 1.1:
Run the cell below and check what kind of object was created using the method type.
new_dict = {'names': ['Gauss', 'Newton', 'Lagrange', 'Euler'],
'birth year': [1777, 1643, 1736, 1707]}
YOUR_CODE_HERE
Task 1.2:
Run the cell below and check what kind of object was created using the method type.
df = pd.DataFrame(new_dict)
YOUR_CODE_HERE
Task 1.3:
Read the code below and try to predict what the answer should be before you run it and view the output. Then run the cell, confirm your guess and in the second cell check what kind of object was created using the method type.
guess = df.loc[df['birth year'] <= 1700, 'names']
print(guess)
YOUR_CODE_HERE
Note that this is a Series
data type, which is part of the pandas package (you can read about it here). If you need to use the value that is stored in the series, you can use the attribute values
as if it were an object with the same type
as the data in the Series
; the example below shows that the names
in the DataFrame
is a Series
where the data has type ndarray
.
print(type(df.loc[df['birth year'] <= 1700, 'names']))
print(type(df.loc[df['birth year'] <= 1700, 'names'].values))
print('The value in the series is an ndarray with first item:',
df.loc[df['birth year'] <= 1700, 'names'].values[0])
Another useful feature of pandas is to be able to quickly look at the contents of the data frame. You can quickly see which columns are present:
df.head()
You can also get summary information easily:
df.describe()
Finally, it is also very easy to read and write dataframes to a *.csv
file, which you can do using the following commands (you will apply this in the tasks below):
df = pd.read_csv('dams.csv')
To write, the method is similar; the keyword argument index=False
avoids adding a numbered index as an extra column in the csv:
df.to_csv('dams.csv', index=False)
Task 2: Evaluate and process the data¶
For this assignment we will use a small files dams.csv
file that is available in the repository for this PA.
Task 2.1:
Import the dataset as a DataFrame, then explore it and learn about its contents (use the methods presented above; you can also look inside the csv file).
df = YOUR_CODE_HERE
YOUR_CODE_HERE.head()
Task 2.2:
Using the example above, find the dams in the DataFrame that are of type
earth fill.
names_of_earth_dams = YOUR_CODE_HERE
print('The earth fill dams are:', names_of_earth_dams)
Hint: the answer should be: ['Fort Peck' 'Nurek' 'Kolnbrein' 'WAC Bennett']
Task 2.3:
Create a new dataframe that only includes the earth fill dams. Save it as a new csv file called earth_dams.csv.
Hint: you only need to remove a small thing from the code for your answer to the task above).
df_earth = YOUR_CODE_HERE
df_earth.YOUR_CODE_HERE
Task 2.4:
Check the contents of the new csv file to make sure you created it correctly.
End of notebook.