Skip to content

Commit 3935888

Browse files
authored
make aggregation of empty GroupedDataFrame correct with AsTable (#3222)
1 parent 330f9d3 commit 3935888

File tree

6 files changed

+121
-25
lines changed

6 files changed

+121
-25
lines changed

NEWS.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,9 @@
44

55
* Fix incorrect handling of column metadata in `insertcols!` and `insertcols`
66
([#3220](https://github.com/JuliaData/DataFrames.jl/pull/3220))
7+
* Correctly handle `GroupedDataFrame` with no groups in multi-column
8+
operation specification syntax
9+
([#3122](https://github.com/JuliaData/DataFrames.jl/issues/3122))
710

811
## Display improvements
912

docs/src/man/split_apply_combine.md

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -30,15 +30,24 @@ object from your data frame using the `groupby` function that takes two argument
3030
(1) a data frame to be grouped, and (2) a set of columns to group by.
3131

3232
Operations can then be applied on each group using one of the following functions:
33-
* `combine`: does not put restrictions on number of rows returned, the order of rows
34-
is specified by the order of groups in `GroupedDataFrame`; it is typically used
35-
to compute summary statistics by group;
33+
* `combine`: does not put restrictions on number of rows returned per group;
34+
the returned values are vertically concatenaded following order of groups in
35+
`GroupedDataFrame`; it is typically used to compute summary statistics by group;
36+
for `GroupedDataFrame` if grouping columns are kept they are put as first columns
37+
in the result;
3638
* `select`: return a data frame with the number and order of rows exactly the same
3739
as the source data frame, including only new calculated columns;
3840
`select!` is an in-place version of `select`;
3941
* `transform`: return a data frame with the number and order of rows exactly the same
4042
as the source data frame, including all columns from the source and new calculated columns;
41-
`transform!` is an in-place version of `transform`.
43+
`transform!` is an in-place version of `transform`;
44+
existing columns in the source data frame are put as first columns in the result;
45+
46+
As a special case, if a `GroupedDataFrame` that has zero groups is passed then
47+
the result of the operation is determined by performing a single call to the
48+
transformation function with a 0-row argument passed to it. The output of this
49+
operation is only used to identify the number and type of produced columns, but
50+
the result has zero rows.
4251

4352
All these functions take a specification of one or more functions to apply to
4453
each subset of the `DataFrame`. This specification can be of the following forms:

src/abstractdataframe/selection.jl

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -37,19 +37,26 @@ const TRANSFORMATION_COMMON_RULES =
3737
(1) a data frame to be grouped, and (2) a set of columns to group by.
3838
3939
Operations can then be applied on each group using one of the following functions:
40-
* `combine`: does not put restrictions on number of rows returned, the order of rows
41-
is specified by the order of groups in `GroupedDataFrame`; it is typically used
42-
to compute summary statistics by group; for `GroupedDataFrame` if grouping columns
43-
are kept they are put as first columns in the result;
40+
* `combine`: does not put restrictions on number of rows returned per group;
41+
the returned values are vertically concatenaded following order of groups in
42+
`GroupedDataFrame`; it is typically used to compute summary statistics by group;
43+
for `GroupedDataFrame` if grouping columns are kept they are put as first columns
44+
in the result;
4445
* `select`: return a data frame with the number and order of rows exactly the same
4546
as the source data frame, including only new calculated columns;
4647
`select!` is an in-place version of `select`; for `GroupedDataFrame` if grouping columns
4748
are kept they are put as first columns in the result;
4849
* `transform`: return a data frame with the number and order of rows exactly the same
4950
as the source data frame, including all columns from the source and new calculated columns;
50-
`transform!` is an in-place version of `transform`; for `GroupedDataFrame`
51+
`transform!` is an in-place version of `transform`;
5152
existing columns in the source data frame are put as first columns in the result;
5253
54+
As a special case, if a `GroupedDataFrame` that has zero groups is passed then
55+
the result of the operation is determined by performing a single call to the
56+
transformation function with a 0-row argument passed to it. The output of this
57+
operation is only used to identify the number and type of produced columns, but
58+
the result has zero rows.
59+
5360
All these functions take a specification of one or more functions to apply to
5461
each subset of the `DataFrame`. This specification can be of the following forms:
5562
1. standard column selectors (integers, `Symbol`s, strings, vectors of integers,

src/groupeddataframe/complextransforms.jl

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,6 @@ function _combine_with_first((first,)::Ref{Any},
2828
@assert first isa Union{NamedTuple, DataFrameRow, AbstractDataFrame}
2929
@assert f isa Base.Callable
3030
@assert incols isa Union{Nothing, AbstractVector, Tuple, NamedTuple}
31-
@assert first isa Union{NamedTuple, DataFrameRow, AbstractDataFrame}
3231
extrude = false
3332

3433
lgd = length(gd)

src/groupeddataframe/splitapplycombine.jl

Lines changed: 27 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -486,6 +486,23 @@ function _combine_process_pair_symbol(optional_i::Bool,
486486
end
487487
end
488488

489+
@noinline function expand_res_astable(res, kp1, emptyres::Bool)
490+
prepend = all(x -> x isa Integer, kp1)
491+
if !(prepend || all(x -> x isa Symbol, kp1) || all(x -> x isa AbstractString, kp1))
492+
throw(ArgumentError("keys of the returned elements must be " *
493+
"`Symbol`s, strings or integers"))
494+
end
495+
if any(x -> !isequal(keys(x), kp1), res)
496+
throw(ArgumentError("keys of the returned elements must be equal"))
497+
end
498+
outcols = [[x[n] for x in res] for n in kp1]
499+
# make sure we only infer column names and types for empty res, but do not
500+
# produce values that were generated when computing firstres
501+
emptyres && foreach(empty!, outcols)
502+
nms = [prepend ? Symbol("x", n) : Symbol(n) for n in kp1]
503+
return outcols, nms
504+
end
505+
489506
# perform a transformation specified using the Pair notation with multiple output columns
490507
function _combine_process_pair_astable(optional_i::Bool,
491508
gd::GroupedDataFrame,
@@ -506,19 +523,15 @@ function _combine_process_pair_astable(optional_i::Bool,
506523
firstmulticol, NOTHING_IDX_AGG, threads)
507524
@assert length(outcol_vec) == 1
508525
res = outcol_vec[1]
509-
@assert length(res) > 0
510-
511-
kp1 = keys(res[1])
512-
prepend = all(x -> x isa Integer, kp1)
513-
if !(prepend || all(x -> x isa Symbol, kp1) || all(x -> x isa AbstractString, kp1))
514-
throw(ArgumentError("keys of the returned elements must be " *
515-
"`Symbol`s, strings or integers"))
516-
end
517-
if any(x -> !isequal(keys(x), kp1), res)
518-
throw(ArgumentError("keys of the returned elements must be identical"))
526+
if isempty(res)
527+
emptyres = true
528+
res = firstres
529+
else
530+
emptyres = false
519531
end
520-
outcols = [[x[n] for x in res] for n in kp1]
521-
nms = [prepend ? Symbol("x", n) : Symbol(n) for n in kp1]
532+
kp1 = isempty(res) ? () : keys(res[1])
533+
534+
outcols, nms = expand_res_astable(res, kp1, emptyres)
522535
else
523536
if !firstmulticol
524537
firstres = Tables.columntable(firstres)
@@ -527,9 +540,8 @@ function _combine_process_pair_astable(optional_i::Bool,
527540
end
528541
idx, outcols, nms = _combine_multicol(Ref{Any}(firstres), Ref{Any}(fun), gd,
529542
wincols, threads)
530-
531543
if !(firstres isa Union{AbstractVecOrMat, AbstractDataFrame,
532-
NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}}})
544+
NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}}})
533545
lock(gd.lazy_lock) do
534546
# if idx_agg was not computed yet it is nothing
535547
# in this case if we are not passed a vector compute it.
@@ -541,8 +553,8 @@ function _combine_process_pair_astable(optional_i::Bool,
541553
idx = idx_agg[]
542554
end
543555
end
544-
@assert length(outcols) == length(nms)
545556
end
557+
@assert length(outcols) == length(nms)
546558
if out_col_name isa AbstractVector{Symbol}
547559
if length(out_col_name) != length(nms)
548560
throw(ArgumentError("Number of returned columns is $(length(nms)) " *

test/grouping.jl

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4312,4 +4312,70 @@ end
43124312
@test_throws ArgumentError gdf[Not([true true true true])]
43134313
end
43144314

4315+
@testset "aggregation of empty GroupedDataFrame with table output" begin
4316+
df = DataFrame(:a => Int[])
4317+
gdf = groupby(df, :a)
4318+
@test isequal_typed(combine(gdf, :a => (x -> [(x=1, y="a")]) => AsTable, :a => :b),
4319+
DataFrame(a=Int[], x=Int[], y=String[], b=Int[]))
4320+
@test isequal_typed(combine(gdf, :a => (x -> [(1, "a")]) => AsTable, :a => :b),
4321+
DataFrame(a=Int[], x1=Int[], x2=String[], b=Int[]))
4322+
@test isequal_typed(combine(gdf, :a => (x -> ["ab"]) => AsTable, :a => :b),
4323+
DataFrame(a=Int[], x1=Char[], x2=Char[], b=Int[]))
4324+
# test below errors because keys for strings do not support == comparison
4325+
@test_throws ArgumentError combine(gdf, :a => (x -> ["ab", "cd"]) => AsTable, :a => :b)
4326+
@test isequal_typed(combine(gdf, :a => (x -> []) => AsTable, :a => :b),
4327+
DataFrame(a=Int[], b=Int[]))
4328+
@test_throws ArgumentError combine(gdf, :a => (x -> [(a=x, b=x), (a=x, c=x)]) => AsTable)
4329+
@test isequal_typed(combine(gdf, :a => (x -> [(x=1, y=2), (x=3, y="a")]) => AsTable),
4330+
DataFrame(a=Int[], x=Int[], y=Any[]))
4331+
@test isequal_typed(combine(gdf, :a => (x -> [(x=[1], y=2), (x=[3], y="a")]) => AsTable),
4332+
DataFrame(a=Int[], x=Vector{Int}[], y=Any[]))
4333+
@test isequal_typed(combine(gdf, :a => (x -> [(x=[1], y=2), (x=[3], y="a")]) => [:z1, :z2]),
4334+
DataFrame(a=Int[], z1=Vector{Int}[], z2=Any[]))
4335+
@test_throws ArgumentError combine(gdf, :a => (x -> [(x=[1], y=2), (x=[3], y="a")]) => [:z1, :z2, :z3])
4336+
4337+
df = DataFrame(:a => [1, 2])
4338+
gdf = groupby(df, :a)[2:1]
4339+
@test isequal_typed(combine(gdf, :a => (x -> [(x=1, y="a")]) => AsTable, :a => :b),
4340+
DataFrame(a=Int[], x=Int[], y=String[], b=Int[]))
4341+
@test isequal_typed(combine(gdf, :a => (x -> [(1, "a")]) => AsTable, :a => :b),
4342+
DataFrame(a=Int[], x1=Int[], x2=String[], b=Int[]))
4343+
@test isequal_typed(combine(gdf, :a => (x -> ["ab"]) => AsTable, :a => :b),
4344+
DataFrame(a=Int[], x1=Char[], x2=Char[], b=Int[]))
4345+
# test below errors because keys for strings do not support == comparison
4346+
@test_throws ArgumentError combine(gdf, :a => (x -> ["ab", "cd"]) => AsTable, :a => :b)
4347+
@test isequal_typed(combine(gdf, :a => (x -> []) => AsTable, :a => :b),
4348+
DataFrame(a=Int[], b=Int[]))
4349+
@test_throws ArgumentError combine(gdf, :a => (x -> [(a=x, b=x), (a=x, c=x)]) => AsTable)
4350+
@test isequal_typed(combine(gdf, :a => (x -> [(x=1, y=2), (x=3, y="a")]) => AsTable),
4351+
DataFrame(a=Int[], x=Int[], y=Any[]))
4352+
@test isequal_typed(combine(gdf, :a => (x -> [(x=[1], y=2), (x=[3], y="a")]) => AsTable),
4353+
DataFrame(a=Int[], x=Vector{Int}[], y=Any[]))
4354+
@test isequal_typed(combine(gdf, :a => (x -> [(x=[1], y=2), (x=[3], y="a")]) => [:z1, :z2]),
4355+
DataFrame(a=Int[], z1=Vector{Int}[], z2=Any[]))
4356+
@test_throws ArgumentError combine(gdf, :a => (x -> [(x=[1], y=2), (x=[3], y="a")]) => [:z1, :z2, :z3])
4357+
4358+
df = DataFrame(:a => [1, 2])
4359+
gdf = groupby(df, :a)
4360+
@test isequal_typed(combine(gdf, :a => (x -> [(x=1, y="a")]) => AsTable, :a => :b),
4361+
DataFrame(a=1:2, x=[1, 1], y=["a", "a"], b=1:2))
4362+
@test isequal_typed(combine(gdf, :a => (x -> [(1, "a")]) => AsTable, :a => :b),
4363+
DataFrame(a=1:2, x1=[1, 1], x2=["a", "a"], b=1:2))
4364+
@test isequal_typed(combine(gdf, :a => (x -> ["ab"]) => AsTable, :a => :b),
4365+
DataFrame(a=1:2, x1=['a', 'a'], x2=['b', 'b'], b=1:2))
4366+
# test below errors because keys for strings do not support == comparison
4367+
@test_throws ArgumentError combine(gdf, :a => (x -> ["ab", "cd"]) => AsTable, :a => :b)
4368+
@test isequal_typed(combine(gdf, :a => (x -> []) => AsTable, :a => :b),
4369+
DataFrame(a=1:2, b=1:2))
4370+
@test_throws ArgumentError combine(gdf, :a => (x -> [(a=x, b=x), (a=x, c=x)]) => AsTable)
4371+
@test isequal_typed(combine(gdf, :a => (x -> [(x=1, y=2), (x=3, y="a")]) => AsTable),
4372+
DataFrame(a=[1, 1, 2, 2], x=[1, 3, 1, 3], y=Any[2, "a", 2, "a"]))
4373+
@test isequal_typed(combine(gdf, :a => (x -> [(x=[1], y=2), (x=[3], y="a")]) => AsTable),
4374+
DataFrame(a=[1, 1, 2, 2], x=[[1], [3], [1], [3]], y=Any[2, "a", 2, "a"]))
4375+
@test isequal_typed(combine(gdf, :a => (x -> [(x=[1], y=2), (x=[3], y="a")]) => [:z1, :z2]),
4376+
DataFrame(a=[1, 1, 2, 2], z1=[[1], [3], [1], [3]], z2=Any[2, "a", 2, "a"]))
4377+
@test_throws ArgumentError combine(gdf, :a => (x -> [(x=[1], y=2), (x=[3], y="a")]) => [:z1, :z2, :z3])
4378+
@test_throws ArgumentError combine(gdf, :a => (x -> [Dict('x' => 1)]) => AsTable)
4379+
end
4380+
43154381
end # module

0 commit comments

Comments
 (0)