Thursday, July 6, 2017

Thursday: No seriously, this week is all messed up for me now.

I spent part of the day thinking it was Monday, because we had a meeting, and those are usually on Monday.  But it's not, so tomorrow I have to try and sort out things that I want to finish this week.

Also, I learned that the answer to "I have all this crap in a data frame, and I want answers for each subset, but I don't want to have to do any work" is ddply.  Originally I did this with work data, but using the NSF PhD statistics works as well.

library(plyr)

nsf = data.frame(read.table("./matched.dat",sep='\t',header=TRUE))
nsf$R = nsf$female / (nsf$male + nsf$female)
q = ddply(nsf, .(broad_field), function(x) c(m = mean(x$R),s = sd(x$R), quantile(x$R, c(0.0, 0.5, 0.75, 0.90, 0.98, 1.0))))
                           broad_field         m           s
1                            Education 0.6780287 0.012619293
2                          Engineering 0.2169928 0.014537864
3                  Humanities and arts 0.5035801 0.007278568
4                        Life sciences 0.5384433 0.017871779
5    Mathematics and computer sciences 0.2479207 0.009173072
6                               Otherb 0.5059506 0.011422802
7 Physical sciences and earth sciences 0.3117802 0.018656152
8       Psychology and social sciences 0.5832883 0.011162981

        50%       75%       90%       98%      100%
1 0.6816616 0.6865097 0.6927421 0.6930276 0.6930990
2 0.2220837 0.2279097 0.2304950 0.2320948 0.2324947
3 0.5058046 0.5084105 0.5096879 0.5120727 0.5126689
4 0.5453048 0.5518069 0.5544373 0.5564318 0.5569305
5 0.2466649 0.2529897 0.2614564 0.2629909 0.2633745
6 0.5105890 0.5143730 0.5165125 0.5207457 0.5218040
7 0.3140528 0.3249957 0.3339903 0.3353106 0.3356407
8 0.5849175 0.5890918 0.5946965 0.5973446 0.5980066

Ok, so the quantile data isn't super useful with the NSF data, but still.  Means and sigmas for each factor, and for the work data, pulling the quantile stuff was useful.  Then, hubris took hold, and I wondered if I use this to do linear fits to each factor as well.  The answer is no, not with ddply, because that reads and writes a data frame, and lm outputs a model object.  So you have to use dlply to save those models in a list, and then make a data frame from the stuff you care about in the list:

f = dlply(nsf, .(broad_field), lm, formula = R ~ year)
coeffs = ldply(f,coef)

coeffs$x2010 = (coeffs[,2] + 2010 * coeffs[,3]) * 100
coeffs$m = coeffs[,3] * 100

                           broad_field (Intercept)          year    x2010           m
1                            Education   -5.318890  0.0029835418 67.80287  0.29835418 
2                          Engineering   -7.627153  0.0039025598 21.69928  0.39025598 
3                  Humanities and arts   -1.145504  0.0008204401 50.35801  0.08204401 
4                        Life sciences   -9.438112  0.0049634604 53.84433  0.49634604 
5    Mathematics and computer sciences    1.055520 -0.0004017907 24.79207 -0.04017907 
6                               Otherb   -3.713211  0.0020990852 50.59506  0.20990852 
7 Physical sciences and earth sciences  -10.098766  0.0051793762 31.17802  0.51793762 
8       Psychology and social sciences   -4.194659  0.0023770883 58.32883  0.23770883 

And yes, x2010 is the same as the mean above, and m (percent change per year) is approximately 1/5 of the sigma above.  Taking a quick average of just the math, engineering, and physical sciences data for 2017 gives something like 28%.  The lack of improvement in math is a problem, and engineering starts at a deficit.  The depressing thing is that even with the highest improvement rate for the physical sciences, it's still something like 40 years until parity.

Also a plot:

library(ggplot2)
ggplot(nsf,aes(x=year,y=R,color=broad_field)) + geom_point() + geom_line() + geom_smooth(method=lm)
ggsave("/tmp/with_fits.png")

Yes, I'm sure there's a line thickness parameter I could have changed.  Whatever.

No comments:

Post a Comment