improve data cleaning method
Cleaning datasets based on Z function is not a good idea. Cleaning with Z = +/- 3 is terrible, since it assumes a normal distribution based on the mean and std dev, which will leave out a huge number of data for heavily skewed datasets. We can expand this method in a number of ways. Some ideas to get started:
- at the very least, use a higher default Z score. Robert already made a commit with 5 instead of 3
- is there a better alternative to the Z function that can take into account skewness? obviously we can't assume a distribution here, that's the point of this whole package, so it needs to be very generalized...perhaps we can compute skewness and or information about each tail and make a Z score for each side?
- give user information first about the number of data eliminated for various z values, then ask to update z threshold?
- regardless of approach, number of eliminated data should be stored as an attribute in the class instance after cleaning
- if cleaning is, or can be, performed multiple times, maybe store this in a list, recording sequential cleanings? For example, clean with Z=10 and see that x data are removed, clean again with Z = 8 and see that y data are removed; total data removed = x + 7
- tracking of eliminated points above can be done for nan's also