PA 1.3: Data Cleaning and Boosting Productivity¶

No description has been provided for this image No description has been provided for this image

CEGM1000 MUDE: Week 1.3. Due: before Friday, Sep 20, 2023.

This notebook consists of two parts:

  1. Data Cleaning, with task 1.1 - 2.6
  2. Boosting Productivity, with task 3.1 - 3.10

Remember that there is a survey that must be completed to pass this PA.

Here is a link to the survey.

Data cleaning¶

Often we get data in a file that contains unexpected and odd things inside. If not removed in a proper way, they can cause problems in our analysis. For example, NaNs, infinite values, or just really large outliers may cause things in our code to behave in an unexpected way. It is good practice to get in the habit of visualizing and processing datasets before you start using them! This programming assignment will illustrate this process.

Topics in this assignment includes two tasks:

  1. Finding "odd" values in an array and removing them
  2. Using plots to identify other "oddities" that can be removed

We will need one csv file, data_2.csv, to complete this assignment.

In [ ]:
# use the mude-base environment

import numpy as np
import matplotlib.pyplot as plt

Task 1: Importing and Cleaning the array¶

In a previous week we looked at how to read in data from a csv, plot a nice graph and even find the $R^2$ of the data. This week an eager botany student, Johnathan, has asked us to help him analyze some data: 1000 measurements have just been completed over the 100m of greenhouse and are ready to use in data_2.csv. Johnathan happens to have a lot of free time but not that much experience taking measurements. Thus, there is some noise in the data and some problematic data that are a result of an error in the measurement device. Let's help them out!

Task 1.1: Import the data as 2 numpy arrays: distance and temperature.

In [ ]:
distance, temperature = np.genfromtxt("data_2.csv", skip_header = 1, delimiter=",", unpack=True)

Task 1.2: In the code cell below, evaluate the size of the array.

In [ ]:
temperature.size

Task 1.3: Check by defining a variable boolean using the numpy method isnan, which returns a boolean vector (False if it is not a NaN, and True if it is a NaN). The code block below will also help you inspect the results.

In [ ]:
boolean = np.isnan(temperature)

print("The first 10 values are:", boolean[0:10])
print(f"There are {boolean.sum()} NaNs in array temperature")

Let's slice the array using the boolean array we just found to eliminate the NaNs. We can use the symbol ~, which denotes the opposite: we want to keep those where np.isnan gives False as an answer.

In [ ]:
temperature = temperature[~boolean]

Task 1.4: Check the size again, and make sure you recognize that we over-wrote the variable `temperature`. This will have an impact on other cells where you use this variable, for example, if you re-run the cell below Task 1.3, the result will be different, because the array contents have changed!

How big is the array now? How many values were removed?

In [ ]:
temperature.size

But now we have a problem: our distance array still has the entries that correspond to the bad entries in temperature. We can see that the dimensions of the arrays no longer match:

In [ ]:
distance.size==temperature.size

Also, we don't know what the index of the removed values were, since we over-wrote temperature! Luckily we have our boolean array, which records the indices with Nans, which we can also use to update our distance array.

Task 1.5: Use the boolean array from Task 1.3 to remove the matching entries in the distance array, then check that it has the same length as temperature.

In [ ]:
distance = distance[~boolean]
distance.size==temperature.size

Task 2: Visualizing the Dataset¶

Now we can plot the temperature with distance to see what it looks like.

In [ ]:
plt.plot(distance, temperature, "ok", label = 'Temperature')
plt.title("Super duper greenhouse")
plt.xlabel('Distance')
plt.ylabel('Temperature')
plt.show()

It looks like there are some outliers in the dataset still! Let's investigate:

In [ ]:
print(temperature.min())
print(temperature.max())

The values are suspcious since they are +/-999...this is a common error code with some sensors, so we can assume that they can be removed from the dataset. We can easily remove these erroneous values of temperature, but this time we will use a different method than before. The explamation mark before an equals sig, !=, denotes "not equal to." We can use this as a logic operator to directly eliminate the values in one line. For example:

array_1 = array_1[array_2!=-999]

Task 2.1: Use the "not equal to" operator to re-define temperature and distance such that all the temperatures with -999 are removed (don't do the +999 values yet!). Keep in mind that the order of the arrays matters: if you reassign temperature, you won't have the information any more to fix distance!!!

In [ ]:
distance = distance[temperature!=-999]
temperature = temperature[temperature!=-999]

Are the arrays the same size still? If you did it correctly, they should be.

In [ ]:
print(distance.size==temperature.size)
temperature.size

For the +999 values we will use yet another method, a combination of the previous two.

Task 2.2: Use the not equal to operator and a boolean array to define an array "mask" that will help you remove the data corresponding to temperatures with +999.

We can also do it with a boolean for data_y.

In [ ]:
mask = temperature!=999
distance = distance[mask]
temperature = temperature[mask]

The array is names "mask" because this process utilizes masked arrays...you can read more about it here.

Anyway, now that we have removed the annoying +/-999 values, we can finally start to see our dataset more clearly:

In [ ]:
plt.plot(distance, temperature, "ok", label = 'Temperature')
plt.title("Super duper greenhouse")
plt.xlabel('Distance')
plt.ylabel('Temperature')
plt.show()

Looks good! But wait---there also appear to be some values in the array that are not physically possible! We know for sure that there was nothing cold in the greenhouse during the measurements; also it's very likely that a "0" value could have come from an error in the sensor.

See if you can apply the numpy method nonzero to remove zeros from the array. Hint: it works in a very similar way to isnan, which we used above.

Task 2.3: Use nonzero to remove the zeros.

In [ ]:
distance = distance[np.nonzero(temperature)]
temperature = temperature[np.nonzero(temperature)]

It also seems quite obvious that the values above 50 degrees are also not physically possible (or perhaps Jonathan was standing near an oven?!). In any case, they aren't consistent with the rest of the data, so we should remove them.

Task 2.4: Use an inequality, < to keep all values less than 50.

In [ ]:
distance = distance[temperature<50]
temperature = temperature[temperature<50]

Now let's take another look at our data:

In [ ]:
plt.plot(distance, temperature, "ok", label = 'Temperature')
plt.title("Super duper greenhouse")
plt.xlabel('Distance')
plt.ylabel('Temperature')
plt.show()

Let's pretend there is a systematic error in our measurement device because it was not properly calibrated. It causes all observations below 15 degrees need to be corrected dividing the multiplying the measurement by 1.5. Numpy actually makes it very easy to change the contents of an array conditionally by replacement using the where method!

Task 2.5: Play with the cell below to understand what the where method does (i.e., replacement)---it's very useful to know about!

In [ ]:
temperature = np.where(temperature>15, temperature, temperature*1.5)

Remember you can investigate the where function in a notebook easily by executing np.where?. Try it and read the documentation!

Let's plot the array again to see what happened (you'll have to compare the two plots carefully to see the difference). Remember, that if you rerun the cell above many times, it will over-write temperature, so you will probably need to restart the kernel a few times to reset the values.

In [ ]:
plt.plot(distance, temperature, "ok", label = 'Temperature')
plt.title("Super duper greenhouse")
plt.xlabel('Distance')
plt.ylabel('Temperature')
plt.show()

Now that we are done cleaning the data, let's learn about it.

Task 2.6: Calculate the mean and variance of temperature. Use built-in numpy functions.

In [ ]:
mean_temperature = temperature.mean()
print(f"{mean_temperature = :.3f}")

variance_temperature = temperature.var()
print(f"{variance_temperature = :.3f}")

Boosting productivity¶

Did you ever got frustrated by sharing code with your friends? How nice would it be to do it the Google-Docs style and work on the same document simultaneously! Maybe you know Deepnote, which does this in an online interface. But there's a better solution: Visual Studio Live Share!

In this PA you'll install the required extension in VS Code. You'll use this extension on Wednesday in class, and can freely use it in your future career!

Task 3.1: Download, install and login in the Visual Studio Live Share Extension from the Visual Studio Marketplace as explained in the book

After installing and signing into Visual Studio Live Share, you'll share a project with yourself to test the collaboration session

Task 3.2: Use your normal workflow to open a folder

Task 3.3: Start a collaboration session: select Live Share on the status bar or select Ctrl+Shift+P or Cmd+Shift+P and then select Live Share: Start collaboration session (Share).

Live Share Button

The first time you share, your desktop firewall software might prompt you to allow the Live Share agent to open a port. Opening a port is optional. It enables a secured direct mode to improve performance when the person you're working with is on the same network as you. For more information, see changing the connection mode.

An invitation link will be automatically copied to your clipboard. You'll use this link to interact with yourself in this assignment. If you want to collaborate with other, you can share this link with other to open up the project in their browser on own VS Code.

You'll also see the Live Share status bar item change to represent the session state.

Task 3.4: Copy the invitation link in your web browser.

A web version of Visual Studio Code will open in your browser. Continue there.

Task 3.5: Login using the same steps as in task 3.2.

Task 3.6: Go back to your desktop participant of VS code and try typing a few lines of correct code in the cell below. Do you see the same change in the browser happening live?

In [ ]:
print('hello world')

Task 3.7: Go back to your browser version of VS code and try running the cell. Note that this requires Requesting access in the browser participant. Request that access and approve it in the desktop participant of VS code. In the desktop participant you now need to select your python environment. Does the cell run? Do you see the output in both participants? Make sure the output doesn't show an error!

Task 3.8: Let's try to make sense of what's happening. Where is the code executed? Run the following code cell from the browser participant. On which computer does this collaborative session run?

In [ ]:
import platform
print("Running on a", platform.system(),"machine named:", platform.node())

That's cool right? Imagine what you could do with this...

Task 3.9: Explore some of the functionalities of the Live Share (session chat, following). More information can be found here.

Task 3.10: Stop the collaboration session on the desktop-participant by opening the Live Share view on the Explorer tab or the VS Live Share tab and select the Stop collaboration session button:

No description has been provided for this image

Your web-participant will be notified that the session is over. It won't be able to access the content and any temp files will automatically be cleaned up. Don't forget to save your work on the desktop-participant!

End of notebook.

Creative Commons License TU Delft MUDE

© Copyright 2024 MUDE TU Delft. This work is licensed under a CC BY 4.0 License.