"We will need one csv file, `data_2.csv`, to complete this assignment."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a74df1db",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
...
...
@@ -80,7 +72,7 @@
"source": [
"### Task 1: Importing and Cleaning the array\n",
"\n",
"In a previous week we looked at how to read in data from a csv, plot a nice graph and even find the $R^2$ of the data. This week an eager botany student, Johnathan, has asked us to help him analyze some data: 1000 measurements have just been completed over the 100m of greenhouse and are ready to use in `data_2.csv`. Johnathan happens to have a lot of free time but not that much experience taking measurements. Thus, there is some noise in the data and some problematic data that are a result of an error in the measurement device. Let's help them out!"
"In a previous week we looked at how to read in data from a csv, plot a nice graph and even find the $R^2$ of the data. This week, an eager botany student, Johnathan, has asked us to help him analyze some data: 1000 measurements have just been completed over the 100m of greenhouse and are ready to use in `data_2.csv`. Jonathan happens to have a lot of free time but not that much experience taking measurements. Thus, there is some noise in the data and some problematic data that are a result of an error in the measurement device. Let's help them out!"
" Check by defining a variable <code>boolean</code> using the numpy method <code>isnan</code>, which returns a boolean vector (False if it is not a NaN, and True if it is a NaN). The code block below will also help you inspect the results.\n",
" Check if there are NaN (not a number) values in the temperature array. You can use the numpy method <code>isnan</code>, which returns a boolean vector (False if it is not a NaN, and True if it is a NaN). Save the result in the variable <code>temperature_is_nan</code>. The code block below will also help you inspect the results.\n",
"</p>\n",
"</div>"
]
...
...
@@ -149,10 +141,10 @@
"metadata": {},
"outputs": [],
"source": [
"boolean = YOUR_CODE_HERE\n",
"temperature_is_nan = YOUR_CODE_HERE\n",
"\n",
"print(\"The first 10 values are:\", boolean[0:10])\n",
"print(f\"There are {boolean.sum()} NaNs in array temperature\")"
"print(\"The first 10 values are:\", temperature_is_nan[0:10])\n",
"print(f\"There are {temperature_is_nan.sum()} NaNs in array temperature\")"
]
},
{
...
...
@@ -160,7 +152,7 @@
"id": "c9f8994b",
"metadata": {},
"source": [
"Let's slice the array using the `boolean` array we just found to eliminate the NaNs. We can use the symbol `~`, which denotes the opposite: we want to keep those where np.isnan gives False as an answer."
"Let's slice the array using the `temperature_is_nan` array we just found to eliminate the NaNs. We can use the symbol `~`, which denotes the opposite: we want to keep those where np.isnan gives False as an answer."
]
},
{
...
...
@@ -170,7 +162,7 @@
"metadata": {},
"outputs": [],
"source": [
"temperature = temperature[~boolean]"
"temperature = temperature[~temperature_is_nan]"
]
},
{
...
...
@@ -195,7 +187,7 @@
"metadata": {},
"outputs": [],
"source": [
"YOUR_CODE_HERE"
"YOUR_CODE_HERE\n"
]
},
{
...
...
@@ -213,7 +205,7 @@
"metadata": {},
"outputs": [],
"source": [
"distance.size==temperature.size"
"distance.size==temperature.size"
]
},
{
...
...
@@ -221,7 +213,7 @@
"id": "8a80b2d6",
"metadata": {},
"source": [
"Also, we don't know what the index of the removed values were, since we over-wrote `temperature`! Luckily we have our `boolean` array, which records the indices with Nans, which we can also use to update our `distance` array."
"Also, we don't know what the index of the removed values were, since we over-wrote `temperature`! Luckily we have our `temeprature_is_nan` array, which records the indices with Nans, which we can also use to update our `distance` array."
"The values are suspcious since they are +/-999...this is a common error code with some sensors, so we can assume that they can be removed from the dataset. We can easily remove these erroneous values of temperature, but this time we will use a different method than before. The explamation mark before an equals sig, `!=`, denotes \"not equal to.\" We can use this as a logic operator to directly eliminate the values in one line. For example:\n",
"The values are suspcious since they are +/-999...this is a common error code with some sensors, so we can assume that they can be removed from the dataset. We can easily remove these erroneous values of temperature, but this time we will use a different method than before. The exclamation mark before an equal sign, `!=`, denotes \"not equal to.\" We can use this as a logic operator to directly eliminate the values in one line. For example:\n",
"```\n",
"array_1 = array_1[array_2!=-999]\n",
"array_1 = array_1[array_2!=-999]\n",
"```"
]
},
...
...
@@ -341,7 +333,7 @@
"metadata": {},
"outputs": [],
"source": [
"print(distance.size==temperature.size)\n",
"print(distance.size==temperature.size)\n",
"temperature.size"
]
},
...
...
@@ -391,7 +383,7 @@
"id": "830e00fd",
"metadata": {},
"source": [
"The array is names \"mask\" because this process utilizes **masked arrays**...you can read more about it [here](https://python.plainenglish.io/numpy-masks-in-python-d8c13509fbc8)."
"The array is named \"mask\" because this process utilizes **masked arrays**...you can read more about it [here](https://python.plainenglish.io/numpy-masks-in-python-d8c13509fbc8)."
"Looks good! But wait---there also appear to be some values in the array that are not physically possible! We know for sure that there was nothing cold in the greenhouse during the measurements; also it's very likely that a \"0\" value could have come from an error in the sensor.\n",
"Looks good! But wait—there also appear to be some values in the array that are not physically possible! We know for sure that there was nothing cold in the greenhouse during the measurements; also it's very likely that a \"0\" value could have come from an error in the sensor.\n",
"\n",
"See if you can apply the `numpy` method `nonzero` to remove zeros from the array. Hint: it works in a very similar way to `isnan`, which we used above."
"Let's pretend there is a systematic error in our measurement device because it was not properly calibrated. It causes all observations below 15 degrees need to be corrected dividing the multiplying the measurement by 1.5. Numpy actually makes it very easy to change the contents of an array conditionally by replacement using the `where` method!"
"Let's pretend that there is a systematic error in our measurement device because it was not calibrated properly. As a result, all observations below 15 degrees need to be corrected by multiplying the measurement by 1.5. Numpy actually makes it very easy to replace the contents of an array based on a condition using the `where` method!"
"Did you ever got frustrated by sharing code with your friends? How nice would it be to do it the Google-Docs style and work on the same document simultaneously! Maybe you know Deepnote, which does this in an online interface. But there's a better solution: [Visual Studio Live Share](https://visualstudio.microsoft.com/services/live-share/)!\n",
"Have you ever gotten frustrated by sharing code with your friends? How nice would it be to do it the Google-Docs style and work on the same document simultaneously! Maybe you know Deepnote, which does this in an online interface. But there's a better solution: [Visual Studio Live Share](https://visualstudio.microsoft.com/services/live-share/)!\n",
"\n",
"In this PA you'll install the required extension in VS Code. You'll use this extension on Wednesday in class, and can freely use it in your future career!"
" Download, install and login in the Visual Studio Live Share Extension from the Visual Studio Marketplace as explained in the <a href=\"https://mude.citg.tudelft.nl/2024/book/external/learn-programming/book/install/ide/vsc/vs_live_share.html\">book</a>\n",
" Download, install and login in the Visual Studio Live Share Extension from the Visual Studio Marketplace as explained in the <a href=\"https://mude.citg.tudelft.nl/2024/book/external/learn-programming/book/install/ide/vsc/vs_live_share.html\">book</a>.\n",
"</p>\n",
"</div>"
]
...
...
@@ -626,7 +618,7 @@
"id": "2751b89a",
"metadata": {},
"source": [
"After installing and signing into Visual Studio Live Share, you'll share a project with yourself to test the collaboration session"
"After installing and signing into Visual Studio Live Share, you'll share a project with yourself to test the collaboration session."
" Use your normal workflow to open the folder where this notebook is situated in VS Code.\n",
"</p>\n",
"</div>"
]
...
...
@@ -661,17 +653,17 @@
},
{
"cell_type": "markdown",
"id": "bffc6617",
"id": "d3913dbd",
"metadata": {},
"source": [
"An invitation link will be automatically copied to your clipboard. You'll use this link to interact with yourself in this assignment. If you want to collaborate with other, you can share this link with other to open up the project in their browser on own VS Code.\n",
"An invitation link will be automatically copied to your clipboard. You'll use this link to interact with yourself in this assignment. If you want to collaborate with others, you can share this link with them to open up the project in their browser or own VS Code.\n",
"\n",
"You'll also see the **Live Share** status bar item change to represent the session state."
" Go back to your browser version of VS code and try running the cell. Note that this requires Requesting access in the browser participant. Request that access and approve it in the desktop participant of VS code. In the desktop participant you now need to select your python environment. Does the cell run? Do you see the output in both participants? Make sure the output doesn't show an error!\n",
" Go back to your browser version of VS code and try running the cell. Note that this requires the browser participant to request access. Request that access and approve it in the desktop participant of VS code. In the desktop participant you now need to select your Python environment. Does the cell run? Do you see the output in both participants? Make sure the output doesn't show an error!\n",
"Your web-participant will be notified that the session is over. It won't be able to access the content and any temp files will automatically be cleaned up. Don't forget to save your work on the desktop-participant!\n",
"Don't forget to save your work! You could have done that both on the browser-participant as on the desktop-participant!\n",
"</p>\n",
"<p>\n",
"After stopping the collaboration session, your web-participant will be notified that the session is over. It won't be able to access the content and any temp files will automatically be cleaned up.\n",
*[CEGM1000 MUDE](http://mude.citg.tudelft.nl/): Week 1.3. Due: before Friday, Sep 20, 2023.*
%% Cell type:markdown id:f5a4caf7 tags:
This notebook consists of two parts:
1. Data Cleaning, with task 1.1 - 2.6
2. Boosting Productivity, with task 3.1 - 3.10
**Remember that there is a survey that must be completed to pass this PA.**
[Here is a link to the survey](https://forms.office.com/e/saRwPUyL8d).
%% Cell type:markdown id:0b0a42bb tags:
## Data cleaning
Often we get data in a file that contains unexpected and odd things inside. If not removed in a proper way, they can cause problems in our analysis. For example, NaNs, infinite values, or just really large outliers may cause things in our code to behave in an unexpected way. It is good practice to **get in the habit of visualizing and processing datasets before you start using them!** This programming assignment will illustrate this process.
Topics in this assignment includes two tasks:
1. Finding "odd" values in an array and removing them
2. Using plots to identify other "oddities" that can be removed
We will need one csv file, `data_2.csv`, to complete this assignment.
%% Cell type:code id:a74df1db tags:
``` python
```
%% Cell type:code id:846ae254 tags:
``` python
# use the mude-base environment
importnumpyasnp
importmatplotlib.pyplotasplt
```
%% Cell type:markdown id:78611f77 tags:
### Task 1: Importing and Cleaning the array
In a previous week we looked at how to read in data from a csv, plot a nice graph and even find the $R^2$ of the data. This week an eager botany student, Johnathan, has asked us to help him analyze some data: 1000 measurements have just been completed over the 100m of greenhouse and are ready to use in `data_2.csv`. Johnathan happens to have a lot of free time but not that much experience taking measurements. Thus, there is some noise in the data and some problematic data that are a result of an error in the measurement device. Let's help them out!
In a previous week we looked at how to read in data from a csv, plot a nice graph and even find the $R^2$ of the data. This week, an eager botany student, Johnathan, has asked us to help him analyze some data: 1000 measurements have just been completed over the 100m of greenhouse and are ready to use in `data_2.csv`. Jonathan happens to have a lot of free time but not that much experience taking measurements. Thus, there is some noise in the data and some problematic data that are a result of an error in the measurement device. Let's help them out!
Check by defining a variable <code>boolean</code> using the numpy method <code>isnan</code>, which returns a boolean vector (False if it is not a NaN, and True if it is a NaN). The code block below will also help you inspect the results.
Check if there are NaN (not a number) values in the temperature array. You can use the numpy method <code>isnan</code>, which returns a boolean vector (False if it is not a NaN, and True if it is a NaN). Save the result in the variable <code>temperature_is_nan</code>. The code block below will also help you inspect the results.
</p>
</div>
%% Cell type:code id:f830b4ad tags:
``` python
boolean = YOUR_CODE_HERE
temperature_is_nan=YOUR_CODE_HERE
print("The first 10 values are:", boolean[0:10])
print(f"There are {boolean.sum()} NaNs in array temperature")
print("The first 10 values are:",temperature_is_nan[0:10])
print(f"There are {temperature_is_nan.sum()} NaNs in array temperature")
```
%% Cell type:markdown id:c9f8994b tags:
Let's slice the array using the `boolean` array we just found to eliminate the NaNs. We can use the symbol `~`, which denotes the opposite: we want to keep those where np.isnan gives False as an answer.
Let's slice the array using the `temperature_is_nan` array we just found to eliminate the NaNs. We can use the symbol `~`, which denotes the opposite: we want to keep those where np.isnan gives False as an answer.
Check the size again, and make sure you recognize that we over-wrote the variable `temperature`. This will have an impact on other cells where you use this variable, for example, if you re-run the cell below Task 1.3, the result will be different, because the array contents have changed!
How big is the array now? How many values were removed?
</p>
</div>
%% Cell type:code id:1a111476 tags:
``` python
YOUR_CODE_HERE
```
%% Cell type:markdown id:94771b6e tags:
But now we have a problem: our `distance` array still has the entries that correspond to the bad entries in `temperature`. We can see that the dimensions of the arrays no longer match:
%% Cell type:code id:ade77e1b tags:
``` python
distance.size==temperature.size
distance.size==temperature.size
```
%% Cell type:markdown id:8a80b2d6 tags:
Also, we don't know what the index of the removed values were, since we over-wrote `temperature`! Luckily we have our `boolean` array, which records the indices with Nans, which we can also use to update our `distance` array.
Also, we don't know what the index of the removed values were, since we over-wrote `temperature`! Luckily we have our `temeprature_is_nan` array, which records the indices with Nans, which we can also use to update our `distance` array.
It looks like there are some outliers in the dataset still! Let's investigate:
%% Cell type:code id:e05bdc36 tags:
``` python
print(temperature.min())
print(temperature.max())
```
%% Cell type:markdown id:9e8d396a tags:
The values are suspcious since they are +/-999...this is a common error code with some sensors, so we can assume that they can be removed from the dataset. We can easily remove these erroneous values of temperature, but this time we will use a different method than before. The explamation mark before an equals sig, `!=`, denotes "not equal to." We can use this as a logic operator to directly eliminate the values in one line. For example:
The values are suspcious since they are +/-999...this is a common error code with some sensors, so we can assume that they can be removed from the dataset. We can easily remove these erroneous values of temperature, but this time we will use a different method than before. The exclamation mark before an equal sign, `!=`, denotes "not equal to." We can use this as a logic operator to directly eliminate the values in one line. For example:
Use the "not equal to" operator to re-define temperature and distance such that all the temperatures with -999 are removed (don't do the +999 values yet!). Keep in mind that the order of the arrays matters: if you reassign temperature, you won't have the information any more to fix distance!!!
</p>
</div>
%% Cell type:code id:760ef4fc tags:
``` python
YOUR_CODE_HERE
YOUR_CODE_HERE
```
%% Cell type:markdown id:62f66295 tags:
Are the arrays the same size still? If you did it correctly, they should be.
%% Cell type:code id:c2565d1a tags:
``` python
print(distance.size==temperature.size)
print(distance.size==temperature.size)
temperature.size
```
%% Cell type:markdown id:7387ef6e tags:
For the +999 values we will use yet another method, a combination of the previous two.
Use the not equal to operator <b>and</b> a boolean array to define an array "mask" that will help you remove the data corresponding to temperatures with +999.
</p>
</div>
%% Cell type:markdown id:b5aa308b tags:
We can also do it with a boolean for data_y.
%% Cell type:code id:87c3df95 tags:
``` python
mask=YOUR_CODE_HERE
distance=distance[mask]
temperature=temperature[mask]
```
%% Cell type:markdown id:830e00fd tags:
The array is names "mask" because this process utilizes **masked arrays**...you can read more about it [here](https://python.plainenglish.io/numpy-masks-in-python-d8c13509fbc8).
The array is named "mask" because this process utilizes **masked arrays**...you can read more about it [here](https://python.plainenglish.io/numpy-masks-in-python-d8c13509fbc8).
%% Cell type:markdown id:aaf6c255 tags:
Anyway, now that we have removed the annoying +/-999 values, we can finally start to see our dataset more clearly:
Looks good! But wait---there also appear to be some values in the array that are not physically possible! We know for sure that there was nothing cold in the greenhouse during the measurements; also it's very likely that a "0" value could have come from an error in the sensor.
Looks good! But wait—there also appear to be some values in the array that are not physically possible! We know for sure that there was nothing cold in the greenhouse during the measurements; also it's very likely that a "0" value could have come from an error in the sensor.
See if you can apply the `numpy` method `nonzero` to remove zeros from the array. Hint: it works in a very similar way to `isnan`, which we used above.
It also seems quite obvious that the values above 50 degrees are also not physically possible (or perhaps Jonathan was standing near an oven?!). In any case, they aren't consistent with the rest of the data, so we should remove them.
Let's pretend there is a systematic error in our measurement device because it was not properly calibrated. It causes all observations below 15 degrees need to be corrected dividing the multiplying the measurement by 1.5. Numpy actually makes it very easy to change the contents of an array conditionally by replacement using the `where` method!
Let's pretend that there is a systematic error in our measurement device because it was not calibrated properly. As a result, all observations below 15 degrees need to be corrected by multiplying the measurement by 1.5. Numpy actually makes it very easy to replace the contents of an array based on a condition using the `where` method!
Remember you can investigate the `where` function in a notebook easily by executing `np.where?`. Try it and read the documentation!
Let's plot the array again to see what happened (you'll have to compare the two plots carefully to see the difference). Remember, that if you rerun the cell above many times, it will over-write `temperature`, so you will probably need to restart the kernel a few times to reset the values.
Calculate the mean and variance of temperature. Use built-in numpy functions.
</p>
</div>
%% Cell type:code id:c833c8fd tags:
``` python
YOUR_CODE_HERE
```
%% Cell type:markdown id:ba022766 tags:
## Boosting productivity
%% Cell type:markdown id:2df59bce tags:
Did you ever got frustrated by sharing code with your friends? How nice would it be to do it the Google-Docs style and work on the same document simultaneously! Maybe you know Deepnote, which does this in an online interface. But there's a better solution: [Visual Studio Live Share](https://visualstudio.microsoft.com/services/live-share/)!
Have you ever gotten frustrated by sharing code with your friends? How nice would it be to do it the Google-Docs style and work on the same document simultaneously! Maybe you know Deepnote, which does this in an online interface. But there's a better solution: [Visual Studio Live Share](https://visualstudio.microsoft.com/services/live-share/)!
In this PA you'll install the required extension in VS Code. You'll use this extension on Wednesday in class, and can freely use it in your future career!
Download, install and login in the Visual Studio Live Share Extension from the Visual Studio Marketplace as explained in the <a href="https://mude.citg.tudelft.nl/2024/book/external/learn-programming/book/install/ide/vsc/vs_live_share.html">book</a>
Download, install and login in the Visual Studio Live Share Extension from the Visual Studio Marketplace as explained in the <ahref="https://mude.citg.tudelft.nl/2024/book/external/learn-programming/book/install/ide/vsc/vs_live_share.html">book</a>.
</p>
</div>
%% Cell type:markdown id:2751b89a tags:
After installing and signing into Visual Studio Live Share, you'll share a project with yourself to test the collaboration session
After installing and signing into Visual Studio Live Share, you'll share a project with yourself to test the collaboration session.
Start a collaboration session: select <strong>Live Share</strong> on the status bar or select <strong>Ctrl+Shift+P</strong> or <strong>Cmd+Shift+P</strong> and then select <strong>Live Share: Start collaboration session (Share)</strong>.
The first time you share, your desktop firewall software might prompt you to allow the Live Share agent to open a port. Opening a port is optional. It enables a secured direct mode to improve performance when the person you're working with is on the same network as you. For more information, see <ahref="https://learn.microsoft.com/en-us/visualstudio/liveshare/reference/connectivity#changing-the-connection-mode">changing the connection mode</a>.
</p>
</div>
%% Cell type:markdown id:bffc6617 tags:
%% Cell type:markdown id:d3913dbd tags:
An invitation link will be automatically copied to your clipboard. You'll use this link to interact with yourself in this assignment. If you want to collaborate with other, you can share this link with other to open up the project in their browser on own VS Code.
An invitation link will be automatically copied to your clipboard. You'll use this link to interact with yourself in this assignment. If you want to collaborate with others, you can share this link with them to open up the project in their browser or own VS Code.
You'll also see the **Live Share** status bar item change to represent the session state.
Go back to your desktop participant of VS code and try typing a few lines of correct code in the cell below. Do you see the same change in the browser happening live?
Go back to your browser version of VS code and try running the cell. Note that this requires Requesting access in the browser participant. Request that access and approve it in the desktop participant of VS code. In the desktop participant you now need to select your python environment. Does the cell run? Do you see the output in both participants? Make sure the output doesn't show an error!
Go back to your browser version of VS code and try running the cell. Note that this requires the browser participant to request access. Request that access and approve it in the desktop participant of VS code. In the desktop participant you now need to select your Python environment. Does the cell run? Do you see the output in both participants? Make sure the output doesn't show an error!
Let's try to make sense of what's happening. Where is the code executed? Run the following code cell from the browser participant. On which computer does this collaborative session run?
</p>
</div>
%% Cell type:code id:a9b9f24b tags:
%% Cell type:code id:3c8723e3 tags:
``` python
importplatform
print("Running on a",platform.system(),"machine named:",platform.node())
```
%% Cell type:markdown id:55b401c9 tags:
%% Cell type:markdown id:b38e7f33 tags:
That's cool right? Imagine what you could do with this...
Explore some of the functionalities of the Live Share (session chat, following). More information can be found <ahref="https://learn.microsoft.com/en-us/visualstudio/liveshare/">here</a>.
Stop the collaboration session on the desktop-participant by opening the Live Share view on the <strong>Explorer</strong> tab or the <strong>VS Live Share tab</strong> and select the <strong>Stop collaboration session</strong> button:
Your web-participant will be notified that the session is over. It won't be able to access the content and any temp files will automatically be cleaned up. Don't forget to save your work on the desktop-participant!
Don't forget to save your work! You could have done that both on the browser-participant as on the desktop-participant!
</p>
<p>
After stopping the collaboration session, your web-participant will be notified that the session is over. It won't be able to access the content and any temp files will automatically be cleaned up.