Spark dayname function implementation#20825
Spark dayname function implementation#20825kazantsev-maksim wants to merge 22 commits intoapache:mainfrom
Conversation
|
|
||
| query T | ||
| SELECT dayname('2008-02-20'::DATE); | ||
| ---- |
There was a problem hiding this comment.
Should we also verify that the data type matches the spark data types (Itf8 / LargeUtf8 / Utf8View)
| fn spark_day_name(days: i32) -> String { | ||
| let weekday = Date32Type::to_naive_date_opt(days).unwrap().weekday(); | ||
| let display_name = get_display_name(weekday.num_days_from_monday()); | ||
| display_name.unwrap() |
There was a problem hiding this comment.
potential panic in unwrap ?
There was a problem hiding this comment.
I tried to improve it
| } | ||
|
|
||
| fn spark_day_name(days: i32) -> String { | ||
| let weekday = Date32Type::to_naive_date_opt(days).unwrap().weekday(); |
There was a problem hiding this comment.
here as well ? Could we guarantee that unwrap is never going to panic ?
There was a problem hiding this comment.
I tried to improve it
|
Test failures : |
|
Thanks for the review @coderfender. Could you please another see, when you have a time? |
coderfender
left a comment
There was a problem hiding this comment.
I think the changes are looking good @kazantsev-maksim . Left a couple of questions for further review . Thank you very much for your patience :)
|
|
||
| query T | ||
| SELECT dayname('2010-04-24'::TIMESTAMP); | ||
| ---- |
There was a problem hiding this comment.
Could we also tests to add actual timestamps instead of dates parsed as timestamps here ?
| DataType::Date32 | DataType::Timestamp(_, _) => spark_day_name_inner(array), | ||
| DataType::Utf8 | DataType::Utf8View | DataType::LargeUtf8 => { | ||
| let date_array = | ||
| cast_with_options(array, &DataType::Date32, &CastOptions::default())?; |
There was a problem hiding this comment.
What would happen in case of an invalid string / error here ? Could probably handle it here and throw an error if it is a malformed input ?
There was a problem hiding this comment.
Invalid strings will be replaced with null values.
| 6 => Some(String::from("Sun")), | ||
| _ => None, | ||
| } | ||
| } |
There was a problem hiding this comment.
I dont think we are handling timezone's support here ? Spark handles timezone in current implementation and this could produce wrong results
| TypeSignatureClass::Timestamp, | ||
| )]), | ||
| TypeSignature::Coercible(vec![Coercion::new_exact( | ||
| TypeSignatureClass::Native(logical_date()), |
There was a problem hiding this comment.
Spark's date java.util.date is Date32 on Rust side of things . Given that we are not handling Date64 in the match arm , do you still think we should have logical_date() as one of the signature supporting Date32 / Date64 ? Perhaps we could change to Date32 ?
| query T | ||
| SELECT dayname('2010-04-24'::TIMESTAMP); | ||
| ---- | ||
| Sat |
There was a problem hiding this comment.
We might need tests to cover various timezones (atleast one apart from UTC) :)
| ---- | ||
| NULL | ||
|
|
||
| query T |
There was a problem hiding this comment.
I tested this in Spark 4.1.1 and found that the empty string behavior depends on ANSI mode:
-- ANSI on (Spark 4 default):
spark-sql> SELECT dayname('');
[CAST_INVALID_INPUT] The value '' of the type "STRING" cannot be cast to "DATE" ...
-- ANSI off:
spark-sql> SET spark.sql.ansi.enabled=false;
spark-sql> SELECT dayname('');
NULL
The current implementation returns NULL, which matches the non-ANSI behavior. Since Spark 4 defaults to ANSI mode, it might be worth supporting both behaviors here — maybe through an enable_ansi_mode flag similar to how mod/pmod handle it in the same crate. That way the caller (e.g. Comet or another Spark-compatible engine) can choose whether invalid string-to-date casts should error or return NULL, matching whichever ANSI mode the user has configured.
There was a problem hiding this comment.
How does this work in terms of nullability? The nullability of dayname seems accurate to Spark (depends on input) but if invalid strings are returned as null, that could violate the contract here. Is it because in Spark that config applies at a previous cast layer instead of during the function execution as in here?
Which issue does this PR close?
N/A
Rationale for this change
Add new spark function: https://spark.apache.org/docs/latest/api/sql/index.html#dayname
What changes are included in this PR?
Are these changes tested?
Yes, tests added as part of this PR.
Are there any user-facing changes?
No, these are new function.