w3hello.com logo
Home PHP C# C++ Android Java Javascript Python IOS SQL HTML videos Categories
Different behaviors for dplyr versus plyr summarize() when "reusing" a variable in the summarize() call

As you said, dplyr reuses variables. As a result your initial code is trying to calculate a standard deviation from just one value. When you look at the formula for the standard deviation:

enter image
description here

you can see that the denominator of the formula will have a 0, which causes the NaN result.

In your second dplyr code, the standard devation is calculated from the original variable. As the groups for which a sd is calculated have n > 1, the denominator in this case is larger than zero which will result in a sd value.

dplyr just takes the last created instance of a variable. In the page @baptiste linked to, you can find this statement of Hadley Wickham from which you can conclude that it's better to use new names when creating new variables.

I think this behavior should be stated explicitly in the documentation.





© Copyright 2018 w3hello.com Publishing Limited. All rights reserved.