Spark quarter function implementation#20808
Spark quarter function implementation#20808kazantsev-maksim wants to merge 25 commits intoapache:mainfrom
Conversation
There was a problem hiding this comment.
Thanks for putting this together — the new quarter UDF is headed in a good direction, but I think there are a couple of compatibility and robustness issues we should address before merging.
The main blocker is that the current registration uses an exact Date32 signature, which looks narrower than Spark’s documented behavior for quarter. I also left a couple of smaller follow-ups around reusing the existing date_part("quarter", ...) path and avoiding the unwrap() panic path.
I’d also love to see coverage added for Spark’s documented string-literal example so we lock that compatibility in.
| impl SparkQuarter { | ||
| pub fn new() -> Self { | ||
| Self { | ||
| signature: Signature::exact(vec![DataType::Date32], Volatility::Immutable), |
There was a problem hiding this comment.
I think this is the main thing we should fix before merging. Right now the UDF is registered with an exact Date32 signature, which means we no longer preserve Spark’s documented call shape for quarter.
Spark’s SQL docs show SELECT quarter('2016-08-31'); returning 3, and this SLT file used to carry that example before it was replaced with explicit ::DATE casts. With the current signature, we only validate the casted form and could end up rejecting the plain string-literal case that Spark accepts.
Could we switch this to a coercible signature, or possibly just route through the existing date_part('quarter', ...) behavior, and add coverage for the uncasted query?
| } | ||
|
|
||
| fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> { | ||
| let [arg] = take_function_args("quarter", args.args)?; |
There was a problem hiding this comment.
Small suggestion here: this seems to repeat the same scalar/array Date32 dispatching pattern we already have in other datetime helpers, while datafusion_functions::datetime::date_part() already supports "quarter".
Would it make sense to delegate to that existing implementation instead? It feels like that would help keep coercion rules, null handling, and any future date-part behavior aligned in one place.
| } | ||
|
|
||
| fn spark_quarter(days: i32) -> Result<i32> { | ||
| let quarter = Date32Type::to_naive_date_opt(days).unwrap().quarter(); |
There was a problem hiding this comment.
One thing that made me a little nervous here is the unwrap(). That introduces a panic path if a malformed Date32 value ever makes it down to this helper.
last_day handles the same kind of conversion by returning an explicit error instead, which feels a bit safer for a public UDF. Could we do the same here so bad inputs show up as query errors rather than aborting execution?
There was a problem hiding this comment.
I tried to rework it
|
Thanks for the review @kosiew. Could you please another see, when you have a time? |
kosiew
left a comment
There was a problem hiding this comment.
@kazantsev-maksim
Thanks for the update.
The unwrap panic concern in quarter.rs looks resolved, and accepting string inputs is a good step forward.
I still see two follow-ups that seem important before this is ready.
| TypeSignature::Exact(vec![DataType::Utf8View]), | ||
| TypeSignature::Exact(vec![DataType::LargeUtf8]), | ||
| TypeSignature::Exact(vec![DataType::Date32]), | ||
| TypeSignature::Exact(vec![DataType::Timestamp( |
There was a problem hiding this comment.
I think there is still one important gap here.
quarter is still declared with an exact Timestamp(Millisecond, None) signature, while Spark's date_part wrapper already uses the broader coercible timestamp path.
Because of that, timestamp inputs with other units or timezones can still get rejected during planning, even though the implementation below handles DataType::Timestamp(_, _) once execution starts.
Could we align this with the existing Spark datetime coercion model so quarter behaves consistently with the rest of that path?
| 1 | ||
|
|
||
| query I | ||
| SELECT quarter('2009-01-12'::string); |
There was a problem hiding this comment.
Nice to see the string coverage added here.
I think we still need the specific regression case from Spark's documented uncasted form.
Right now this file checks quarter('2009-01-12'::string), but it does not restore a plain string literal query like quarter('2016-08-31').
Since preserving that call shape was the reason for broadening the signature, could we add that case back as well?
kosiew
left a comment
There was a problem hiding this comment.
@kazantsev-maksim
Thanks for the patch.
I noticed a few issues that look worth fixing before this lands.
| Ok(Arc::new(Field::new( | ||
| self.name(), | ||
| DataType::Int32, | ||
| args.arg_fields[0].is_nullable(), |
There was a problem hiding this comment.
I think the return-field nullability needs to be loosened here.
Right now return_field_from_args mirrors the input field nullability, but the new string path can produce NULL even when the input is non-null. This patch adds cases like quarter('abc'::string) and quarter(''::string) returning NULL, so quarter(non_null_utf8_col) would still be advertised as Int32 NOT NULL even though execution can yield nulls.
That looks like a schema contract bug. It also differs from existing Spark helpers like next_day, which force nullable output when invalid strings can map to NULL.
There was a problem hiding this comment.
Thanks. fixed
| } | ||
| DataType::Utf8 | DataType::Utf8View | DataType::LargeUtf8 => { | ||
| let date_array = | ||
| cast_with_options(array, &DataType::Date32, &CastOptions::default())?; |
There was a problem hiding this comment.
I am a bit concerned that the new string handling is narrower than the shared datetime coercion path.
This currently forces every string through a Date32 cast before calling date_part. That can reject valid timestamp-shaped strings that date_part already accepts elsewhere, for example date_part('second', '2020-09-08T12:00:12.12345678+00:00') in datafusion/sqllogictest/test_files/datetime/date_part.slt.
Because this does not route through the existing date_part('quarter', ...) behavior, quarter can still diverge from the rest of the datetime coercion model for string inputs. Could we reuse the same coercion path here so the behavior stays aligned?
There was a problem hiding this comment.
@kosiew datafusion's date_part does not support string types, you'll still have to cast to a date type first.
| use arrow::array::{Array, ArrayRef}; | ||
| use arrow::compute::{CastOptions, DatePart, cast_with_options, date_part}; | ||
| use arrow::datatypes::{DataType, Field, FieldRef}; | ||
| use datafusion::logical_expr::{ |
There was a problem hiding this comment.
Nice addition overall. One thing to fix here is the direct import from datafusion::logical_expr.
In datafusion/spark/Cargo.toml, datafusion is still an optional dependency behind the core feature, so this regresses a CI-critical configuration. cargo check -p datafusion-spark --no-default-features now fails with use of unresolved module or unlinked crate 'datafusion'.
The rest of datafusion-spark pulls these expression types from datafusion_expr, which keeps the no-default-features build working. Could we switch this import to match the existing pattern?
There was a problem hiding this comment.
Thanks, fixed
This reverts commit 88a8503.
Which issue does this PR close?
N/A
Rationale for this change
Add new spark function: https://spark.apache.org/docs/latest/api/sql/index.html#quarter
What changes are included in this PR?
Are these changes tested?
Yes, tests added as part of this PR.
Are there any user-facing changes?
No, these are new function.