Back to cleaning...

Andrea Osika
Dec 18, 2020
4 min read

Updated: Jan 25, 2021

I was catching up with a fellow data colleague and shared that I was looking for some practice with data cleaning. I've taken some time to go a little deeper into SQL and visualization with Tableau (oooh-next week's post! I forgot to show you what I learned!) and realized my Python game had rusted a little. I needed some data cleaning to practice around on since modeling demands useable data. The colleague I mention is very much into poli-sci and found a data set that was in .txt format and contained nested data. It was a mess.

When I imported it into pandas, even more so:

The suppliers and recipients are all in the same column and the subsequent data after each recipient is indented one more. I began investigating ways to move columns around. I looked at .pivot(), .unstack(), and other ways.

I thought about how I might tackle identifying who the suppliers were since they came in the row above the first recipient (indicated by the 'R:') and how I could collect those that came after... I noticed some other rows that were indented by another level as well. I thought about making a dictionary as well. But really, all wanted was to shift the columns over...one here one there... When I was talking to my colleague, he mentioned that he was thinking of the .shift() function he never got to use that much. I remembered .shift() from a time series project I recently worked on and was intrigued.

I first isolated the suppliers by finding the first row of recipients (R:) :

#all the suppliers come right before the recipients which are indicated by having an R:
#isolating those first:
df['r'] = df.iloc[:,0].str.find('R:')
df.head()

and creating an index for the recipients,

#idendifying index list of all the recipients
index = df.index
recps = index[df.iloc[:,0].str.find('R:') == 0]
recps

which made making a list of suppliers quite easy:

#identifying index for all the suppliers
sups = recps -1
sups

I started cleaning a little more, but making columns that made sense (sups_recips to me makes sense - it's suppliers and recipients, and I knew I'd delete it):

#renaming Unnamed 0: since it throws key errors
df.rename(columns = {'Unnamed: 0':'sups_recips'}, inplace = True)

#creating a column to shift the values into:
df.insert(0, 'Supplier', value=' ')

Then I made some new columns separate from the original dataframe those using .shift().

shift() takes several arguments. Here are the defaults:

periods=1,

freq=None,

axis=0,

fill_value=None

It's more frequently used in time-series. In this case, we are using the functionality of shifting the number of periods for the columns. Since my value is -1, I'm manipulating the values by one value prior, and since I'm operating on the 1 axis, I'm shifting my values by one column to the left. Slick, right?

#shifting the values over by pulling the indexes and adjacent values
col= df.loc[sups, ["Supplier", "sups_recips"]].shift(-1, axis=1, fill_value=" ")

This gave me the shape to fill in the values in the column I created :

#filling it in
df['Supplier'] = col['Supplier']

Next: I had to pull out all the recipients and put them in their own column. I isolated them but finding all the blanks in the newly formed Supplier column:

recip1idx = df[df['Supplier'].isnull()].index.tolist()

and filled in all the suppliers for each recipient:

df['Supplier'].ffill(inplace=True)

Then, repeated the process, but shifted the columns the other way, just because I could:

1) make a column

df.insert(2, 'Recipient', value=' ')

2) shift it

#shifting the values over by pulling the indexes and adjacent values
col1= df.loc[recip1idx, ["sups_recips", "Recipient"]].shift(1, axis=1, fill_value=" ")

3) replace the values:

df['Recipient'] = col1['Recipient']

I got rid of the column that had originally contained both the suppliers and the recipients:

df.drop('sups_recips', axis=1, inplace=True)

and renamed columns. It was really starting to bug me and took another look:

I can see where in the case of row#13 needs to be shifted over. Luckily, I know how and get to practice some more.

To this, I need to identify which indices need to be moved:

I found the key value for that line... It was tricky since it wasn't a null value. I tried one space with no luck but isolated it with a little slicing and copied it and created a list of indices (It looked like four spaces, but honestly, it didn't work until I copied and pasted the value):

shft_recip = df['Recipient'] == '     '
sups_idx = [i for i, true in enumerate(shft_recip) if true]

then... shifting happened by reassigning the indices with the values of the next column over shifted over by one, then filling in the column space wouldn't be needing anymore with blanks:

#reassigning the indexes with the shifted values:
df.loc[sups_idx] = df.loc[sups_idx].shift(-1, axis=1, fill_value=" ")

Now we're talking:

sooo.. now we just need to get rid of those pesky blanks. Again, isolate the indices of the ones we want. In this case, I want all those who have a value in Supplier and not in Recipient:

dfsr = df[['Supplier', 'Recipient']]
dels = dfsr[dfsr['Recipient'].isnull()].index.tolist()

and get rid of 'em:

df.drop(dels, axis=0, inplace=True)

...and we don't need that 'R:' anymore:

df['Recipient'] = df['Recipient'].str.replace('R:', "")

...and get rid of some other stuff we don't need in a couple of columns that impairs the useability of numeric values and dates:

df['No'] = df['No'].str.replace(')', "")
df['No'] = df['No'].str.replace('(',"")

...how beautiful! Clean, organized, functional data:

Now we can more efficiently .pivot() and .unstack() this data, visualize and use the data!

A special thanks to my colleague https://github.com/MichaelBurak/ who inspired and gimped me through the process.

To see the full notebook: https://github.com/andiosika/shifting

Back to cleaning...

Recent Posts

Comments