见:

plyr函数前两字表示输入及输出形式

a = array, d= data frame, l = list, _ = 不输出

[adl]*ply 是对列表的每一个元素执行指定操作

m*ply 多种输入,接收 矩阵 / list-array / 数据框,按行切分,执行fun,每行内容做为fun的参数传入

r*ply 重复执行某项操作,适合画随机数分布

参数

.margins 数据切片的方式

以二维变量为例,.margins = 1 为按行切,=2 为按列切, = c(1,2) 为按每个cell切分

切片key可以组合,例如 .( product = a * b, round_a = round(a) )

.fun

指定作用于每个切片的函数

.progress

输出进度条,例如.progress=”text”输出文本模式进度条

splat

作用于整个dataframe

ddply(mtcars, .(round(wt)), function(df) mean_hp_cyl(df$hp, df$cyl))
ddply(mtcars, .(round(wt)), splat(mean_hp_cyl))

each:函数列表

each(min,max) 相当于 function (x) c(min = min(x), max = max(x))

例子

排序 arrange(myCars, cyl, desc(disp))

列转换(mutate功能与transform相同,但mutate中新增的列col_b可以引用刚刚新增的列col_a)

baseball <- ddply(baseball, .(id), transform, cyear = year - min(year) + 1)
base2 <- ddply(baseball, .(id), mutate,career_year = year - min(year) + 1)
mutate(df, cyear = year - min(year),cpercent = cyear / (max(year) - min(year)))

子集

subset(somedata, somecol > 0.999)$id

ddply(coefs_df, .(lat, long), subset, value == min(value))

聚合统计

ddply(coefs_df, .(lat, long), summarise,
             ozone_min = min(value), ozone_max = max(value))
ddply(mtcars, .(logcyl = log(cyl)), each(nrow, ncol))

对各year数据中的各列元素求平均

ddply(baseball, .(year), colwise(median))
ddply(baseball, .(year), colwise(nmissing, c("sb", "cs", "so")))

按指定条件分块统计

count(baseball[1:100,], c("id", "year"))

函数f执行失败时,以NULL替代,不错误退出

safef <- failwith(NULL, f)

对传入mutate时,还不存在于df的列做转换

df <- data.frame(a = rep(c("a","b"), each = 10), b = 1:20)
label <- "xxx"
ddply(df, "a", here(mutate), label = paste(label, b))

合并两个data.frame

join(x, y, by = NULL, type = "left", match = "all")

其中,

  • by 指定合并的条件
  • type 指定合并的方式 left/right/inner/full
  • match 指定取出的数据集合 all/first,与sql类似

合并多个data.frame

dfs <- list(
a = data.frame(x = 1:10, a = runif(10)),
b = data.frame(x = 1:10, b = runif(10)),
c = data.frame(x = 1:10, c = runif(10))
)
join_all(dfs)
join_all(dfs, "x")

迭代版本的llply

liply(baseball_id, summarise, mean_rbi = mean(rbi, na.rm = TRUE))

替换值

mapvalues(z, from = c(1, 5, 9), to = c(10, 50, 90))

match_df 根据指定的longterm,筛出baseball中符合条件的原始数据

longterm <- subset(count(baseball, "id"), freq > 25)
bb_longterm <- match_df(baseball, longterm, on="id")
bb_longterm[1:5,]

更新列名

x <- rename(x, replace=c("d" = "c"))

更新取值

y <- factor(c("a", "b", "c", "a"))
revalue(y, c(a = "A", c = "C"))

数据精度调整

round_any(135, 25, floor)
round_any(135, 10, ceiling)

根据指定维度取出子集

x <- array(seq_len(3 * 4 * 5), c(3, 4, 5))
take(x, 3, 1)
take(x, 2, 1)
take(x, 3, 1, drop = TRUE) 


blog comments powered by Disqus

Published

16 November 2012

Tags