The physics of functions: saving space and time.

Andrea Osika
Jan 25, 2021
4 min read

When I am working on a project there's a simple way to keep me on track for obtaining, scrubbing exploring, and modeling the data. I used these steps in order as an outline and was taught to use this OSEM (awesome, right?) method. It's a trick.

As I go through each of these steps, some tasks can get repetitive. So, I use another trick - I write functions to keep me from writing the same lines of code over and over. This saves space and time in a notebook since we aren't repeating n number of code over and over again. They can even be stored externally from a notebook.

When I'm scrubbing the data or cleaning it, I'm making sure that the data I have is complete, in order, and makes sense. Sometimes, we remove what are called outliers or data that is so irregular, it throws off calculations. In a project where I was working on some hypothesis testing, it was necessary to remove outliers. I ended up using the Z-Score to do it in this case. I had several hypotheses I wanted to test on varying aspects of the data. In each case, I needed to find the outliers in order to remove them. I could quickly see this task as becoming repetitive. I found myself writing the same block of code over and over again using Numpy and Scipy.stats Zscore to find the relative zscore in relation to the mean and the standard deviation:

#find zcore and index of outliers
z = np.abs(stats.zscore(data))
idx_outliers = np.where(z>3,True,False)
return idx_outliers

It was pretty simple and in the beginning, the function was simple to build. You can start building a function with the def keyword. The name of the function follows it, and the arguments used in the function go in the parentheses - just like any other method or function. I like to use ones that mean something to me in case I use them again. Close it off with a ':' and you're rolling.

def find_outliers_Z(data):
z = np.abs(stats.zscore(data))
idx_outliers = np.where(z>3,True,False)
return idx_outliers

This worked great. Any time I wanted to find the outliers using the Z score, I could just call it:

find_outliers_Z(df['revenue'])

It was handy, but I ran into problems when I wasn't working from a data frame:

#creating a dictionary from each employee then add their revenue:
reps = {}
for rep in dfr['EmployeeId'].unique():
    reps[str(rep)] = dfr.groupby('EmployeeId').get_group(rep)['SaleRev']

When I tried to use that dictionary in my function, I'd get an error since what I was asking for was a series. To address this and looking forward, I went back to my function and with a lot of trial and error and since I was newer, some collaboration from colleagues, got it to work. I had to add a few 'if' and 'elif' statements in case I'd run into this again, also some notes so when I ran into this same problem, I could just re-use or modify the code. Documentation is a good practice to get into when building a function. In your documentation, be sure to describe the arguments for each one including their requirements, give an example of output and how to use it:

def find_outliers_Z(data,col=None):
   """Use scipy to calcualte absolute Z-scores 
    and return boolean series where True indicates it is an outlier

    Args:
        data (DataFrame,Series,or ndarray): data to test for outliers.
        col (str): If passing a DataFrame, must specify column to use.

    Returns:
        [boolean Series]: A True/False for each row use to slice outliers.
        
    EXAMPLE USE: 
    >> idx_outs = find_outliers_df(df,col='AdjustedCompensation')
    >> good_data = data[~idx_outs].copy()
    """
	if isinstance(data, pd.DataFrame):
        	if col is None:
            	raise Exception('If passing a DataFrame, must provide col=')
        	else:
            	data = data[col]
    	elif isinstance(data,np.ndarray):
        	data= pd.Series(data)

    	elif isinstance(data,pd.Series):
        	pass
    	else:
        	raise Exception('data must be a DataFrame, Series, or np.ndarray')
    
    	z = np.abs(stats.zscore(data))
    	idx_outliers = np.where(z>3,True,False)
    	return idx_outliers

I found a few other applications for creating a few other functions in my notebook. Even though they saved a little space down the line since I wasn't writing an additional 3-7 lines of code over and over again, it still made my notebook messy to leave them in the notebook I was working from. Even defining that one little function ultimately took up 35 lines of code. It wasn't tidy to the eye. I ended up saving them in a .py file in my repo by creating them in a text editor. I like sublime. In this case, I could then just import them and assign them an alias like I do for anything else.

Import functions as fn

But since it was now external to the notebook, I had to import all the libraries that were pertinent to that function or I'd get an error on that, too. It ended up looking like this:

def find_outliers_Z(data,col=None):
    """Use scipy to calcualte absolute Z-scores 
    and return boolean series where True indicates it is an outlier

    Args:
        data (DataFrame,Series,or ndarray): data to test for outliers.
        col (str): If passing a DataFrame, must specify column to use.

    Returns:
        [boolean Series]: A True/False for each row use to slice outliers.
        
    EXAMPLE USE: 
    >> idx_outs = find_outliers_df(df,col='AdjustedCompensation')
    >> good_data = data[~idx_outs].copy()
    """
    #import libraries:
    from scipy import stats
    import numpy as np
    import pandas as pd

    if isinstance(data, pd.DataFrame):
        if col is None:
            raise Exception('If passing a DataFrame, must provide col=')
        else:
            data = data[col]
    elif isinstance(data,np.ndarray):
        data= pd.Series(data)

    elif isinstance(data,pd.Series):
        pass
    else:
        raise Exception('data must be a DataFrame, Series, or np.ndarray')
    
    z = np.abs(stats.zscore(data))
    idx_outliers = np.where(z>3,True,False)
    return idx_outliers

So, back to the dictionary. It ended up working in that case, and was useful in removing the outliers:

#iterate through dictionary to find the outliers
for rep, rep_data in reps.items():
    idx_outs = fn.find_outliers_Z(rep_data)
    print(f'Found {idx_outs.sum()} outliers in Employee # {rep}')
    #create new df with outliers removed
    reps[rep] = rep_data[~idx_outs]
print('\n All of these outliers were removed')

Then I can use any of the functions saved in that file. One thing I learned though is others who look at your notebook or even yourself (months later) might want to see the source code for functions stored in a file like mine. If that's important to you, you can do what I found to help remember:

import functions as fn
## Uncomment the line below to see the source code for the imported functions
# fs.ihelp(Cohen_d,False),fs.ihelp(find_outliers_IQR,False), fs.ihelp(find_outliers_Z,False)

Python has a filesystem abstraction layer. This is another type of documentation that keeps everyone including yourself in the loop on what's happening. Maybe even the for loop you write in your function. Sorry, that was horribly corny. But I couldn't help it.

If you're curious about the project I was working on, you can visit it here.

'till next time, happy problem-solving!

The physics of functions: saving space and time.

Recent Posts

Comments