Also, I learned that the answer to "I have all this crap in a data frame, and I want answers for each subset, but I don't want to have to do any work" is ddply. Originally I did this with work data, but using the NSF PhD statistics works as well.
library(plyr)
nsf = data.frame(read.table("./matched.dat",sep='\t',header=TRUE))
nsf$R = nsf$female / (nsf$male + nsf$female)
q = ddply(nsf, .(broad_field), function(x) c(m = mean(x$R),s = sd(x$R), quantile(x$R, c(0.0, 0.5, 0.75, 0.90, 0.98, 1.0))))
broad_field m s
1 Education 0.6780287 0.012619293
2 Engineering 0.2169928 0.014537864
3 Humanities and arts 0.5035801 0.007278568
4 Life sciences 0.5384433 0.017871779
5 Mathematics and computer sciences 0.2479207 0.009173072
6 Otherb 0.5059506 0.011422802
7 Physical sciences and earth sciences 0.3117802 0.018656152
8 Psychology and social sciences 0.5832883 0.011162981
50% 75% 90% 98% 100%
1 0.6816616 0.6865097 0.6927421 0.6930276 0.6930990
2 0.2220837 0.2279097 0.2304950 0.2320948 0.2324947
3 0.5058046 0.5084105 0.5096879 0.5120727 0.5126689
4 0.5453048 0.5518069 0.5544373 0.5564318 0.5569305
5 0.2466649 0.2529897 0.2614564 0.2629909 0.2633745
6 0.5105890 0.5143730 0.5165125 0.5207457 0.5218040
7 0.3140528 0.3249957 0.3339903 0.3353106 0.3356407
8 0.5849175 0.5890918 0.5946965 0.5973446 0.5980066
Ok, so the quantile data isn't super useful with the NSF data, but still. Means and sigmas for each factor, and for the work data, pulling the quantile stuff was useful. Then, hubris took hold, and I wondered if I use this to do linear fits to each factor as well. The answer is no, not with ddply, because that reads and writes a data frame, and lm outputs a model object. So you have to use dlply to save those models in a list, and then make a data frame from the stuff you care about in the list:
f = dlply(nsf, .(broad_field), lm, formula = R ~ year)
coeffs = ldply(f,coef)
coeffs$x2010 = (coeffs[,2] + 2010 * coeffs[,3]) * 100
coeffs$m = coeffs[,3] * 100
broad_field (Intercept) year x2010 m
1 Education -5.318890 0.0029835418 67.80287 0.29835418
2 Engineering -7.627153 0.0039025598 21.69928 0.39025598
3 Humanities and arts -1.145504 0.0008204401 50.35801 0.08204401
4 Life sciences -9.438112 0.0049634604 53.84433 0.49634604
5 Mathematics and computer sciences 1.055520 -0.0004017907 24.79207 -0.04017907
6 Otherb -3.713211 0.0020990852 50.59506 0.20990852
7 Physical sciences and earth sciences -10.098766 0.0051793762 31.17802 0.51793762
8 Psychology and social sciences -4.194659 0.0023770883 58.32883 0.23770883
And yes, x2010 is the same as the mean above, and m (percent change per year) is approximately 1/5 of the sigma above. Taking a quick average of just the math, engineering, and physical sciences data for 2017 gives something like 28%. The lack of improvement in math is a problem, and engineering starts at a deficit. The depressing thing is that even with the highest improvement rate for the physical sciences, it's still something like 40 years until parity.
Also a plot:
library(ggplot2)
ggplot(nsf,aes(x=year,y=R,color=broad_field)) + geom_point() + geom_line() + geom_smooth(method=lm)
ggsave("/tmp/with_fits.png")
Yes, I'm sure there's a line thickness parameter I could have changed. Whatever. |
- More ggplot information.
- Hrm.
- That sandwich thing popped up again.
- Sits.
- I like this, because even though I have no clue what fandom this is from, I can pick up the entire story just from the pictures and the caption.
- Just some Wonder Woman.
No comments:
Post a Comment