removing missing values in R

Hello.... it's me....I was wondering if after all these years you'd like to meet -

[Adele is awesome :) ]

but I'll try that again...

Hello!

Just been revising some R skills in datacamp (DATACAMP IS AMAAZING!), and thought I'd bring some clarity to the na.rm argument in functions!

(please note - you will need to have installed R (and preferably R studio to try this out [both free])


so when you actively set na.rm to TRUE, you're basically flipping a switch which tells R to start ignoring values in a data set that are missing (hence - NA).

1 + 1 = 2 [HAPPY FACE :D]
But 1 + NA = [VERY SAD FACE :( ]

So, for example consider the following code:

# First, establish some data sets. 
# How about number of times you crave chocolate 
# and icecream per day, over a 7 day period
chocolate_cravings <- c(16, 9, 13, 5, NA, 17, 14)
ice_cream_cravings <- c(17, NA, 5, 16, 8, 13, 14)

# Now let's try and get the average of chocolate_cravings
mean(chocolate_cravings)
[1] NA

So R clearly can't handle finding the mean (average) of chocolate_cravings (CAUSE ITS A HEALTH ENTHUSIAST!!!... just kidding), and that's because R tried to add up a bunch of numbers with a string.

FAIL.

 

So instead of doing that - we can tell R to automatically skip these missing numbers by setting the na.rm argument to TRUE like so:

(NOTE: if you don't include na.rm in your formular, it will be set to FALSE by default)

 

mean(chocolate_cravings, na.rm = TRUE)
[1] 12.33333

 

HORAAA! Great News! IT WORKED!....bad news.... that's a lot of cravings.....

So that's all pretty straight forward but what happens when we try to find the mean of the sum of both our data-sets?

Will R add all remaining numbers?

Will it leave some out?

FIND OUT ON THE NEXT EPISODE OF DRAGON BALL-

yes anyway, as I was about to say, R adds up each pair of elements when it adds data-sets together (see below) -

 

chocolate_cravings <- c(16, 9, 13, 5, NA, 17, 14)
ice_cream_cravings <- c(17, NA, 5, 16, 8, 13, 14)
chocolate_cravings + ice_cream_cravings
[1] 33 NA 18 21 NA 30 28

 

and because of this, it becomes a bit like a 4 year old eating vegetables - it gets extremely picky!

So if it's adding TWO data-sets together, and it comes across one element in a data-set that is missing (NA), even if the corresponding element in the other data-set DOES have a number, it grumpily won't count it.

See the code below:

 

mean(chocolate_cravings, na.rm = TRUE)
[1] 12.33333

sum(chocolate_cravings + ice_cream_cravings, na.rm = TRUE)
[1] 130

sum(chocolate_cravings + ice_cream_cravings)
[1] NA

# Just to demonstrate that only pairs of values are counted
sum(16+9+13+5+17+14+17+5+16+8+13+14)
[1] 147

sum(16+13+5+17+14+17+5+16+13+14)
[1] 130

 

So as you can imagine na.rm can be very useful in the right context, but it is no substitute for cleaning a dataset with many missing values - especially when your working with multiple sets.

...hmmm

why do i suddenly feel like chocolate.....

Danny Baker

Danny Blaker, Melbourne

Danny has a wealth of experience in the start-up and technology sectors spanning over 10 years, and is the founder and co-founder of numerous companies and initiatives, such as Unudge, & Geartooth.

Danny's diverse skill set encompasses disruptive marketing strategy, business strategy, product design, audio production, data analysis (R, R Markdown, Python), graphics design (photoshop, Indesign, Illustrator), communications, social media marketing, project management, growth strategy, UX design, front-end web development (WP, Joomla, JS, Python, HTML, CSS), and corporate law.

Danny is also a spreadsheet expert and an online instructor, teaching at Udemy.com. His courses have amassed over 11,000 students to date.

He also blogs regularly – you can find his posts at www.dannyblaker.com/blog.

You can reach Danny on twitter @DannyBlaker