Skip to content

support transpilation of STRTOK from snowflake to duckdb#7283

Open
fivetran-felixhuang wants to merge 3 commits intomainfrom
transpile_strtok_snowflake_duckdb
Open

support transpilation of STRTOK from snowflake to duckdb#7283
fivetran-felixhuang wants to merge 3 commits intomainfrom
transpile_strtok_snowflake_duckdb

Conversation

@fivetran-felixhuang
Copy link
Collaborator

@fivetran-felixhuang fivetran-felixhuang commented Mar 13, 2026

https://docs.snowflake.com/en/sql-reference/functions/strtok

DuckDB doesn't have strtok, so we need to leverage REGEXP_SPLIT_TO_ARRAY to implement Snowflake's behaviors.

There are some special Snowflake behaviors that we need to take care of:

  1. In Snowflake, each character in the delimiter string is a separate delimiter. For example, if the delimiter is '@.', then both '@' and '.' are treated as delimiters and they can be matched out of order. We need to wrap the delimiter input with regex group.
  2. Snowflake doesn't return empty strings, we need to filter them out
  3. if we have empty delimiter + empty input string , Snowflake returns NULL
  4. If the delimiter is empty, and the input string is non-empty, then the whole string will be treated as one token.
  5. Snowflake returns NULL for negative indices
  6. Snowflake returns NULL if any argument is NULL

For more details on this please check the comments in strtok_sql

Testing

Source Snowflake query

SELECT
STRTOK('user@snowflake.com', '.@', 2) as a1,
STRTOK('user@snowflake.com', '.@', -1) as a2,
STRTOK('user@snowflake.com', '.@', 0) as a3,
STRTOK('user@snowflake.com', '@.', 8) as a4,
STRTOK('user snowflake com') as a5,
STRTOK('user@snowflake.com', '@.') as a6,
STRTOK('user.snowflake.com') as a7,
STRTOK('user.snowflake.com', '.', 3) as a71,
STRTOK('', '') as a8,
STRTOK('abc', '') as a9,
STRTOK('a', '', 1) as a10,
STRTOK('a', '', 2) as a101,
STRTOK('a', '', -1) as a102,
STRTOK('.b.', '.', 1) as a11,
STRTOK('.b.c.d.', '.', 2) as a12,
STRTOK('..', '.', 1) as a122,
STRTOK(NULL, '@.', 2) as a13,
STRTOK('a.b.c$g', NULL, 2) as a14,
STRTOK('a b c', ' ', 2) as a16,
STRTOK('user@snowflake.com', SUBSTR('.@^', 1, 2), 2) as a17,
STRTOK('ab', IFNULL(NULL, '')) as a18,
STRTOK('ab', IFNULL(NULL, ''), 1) as a19,
STRTOK('ab', IFNULL(NULL, ''), 2) as a20,
STRTOK('ab', IFNULL(NULL, ''), 0) as a21,
STRTOK('a.b.c', ifnull('.', '.'), 2) as b2,
 STRTOK('a[b]c', ifnull('[]', '[]'), 2) as b3,
 STRTOK('a^b^c', ifnull('^', '^'), 2) as b4,
 STRTOK('a-b-c', ifnull('-', '-'), 2) as b5,
 STRTOK('abc', ifnull('', ''), 2) as b6,
 STRTOK('a+b+c', ifnull('+', '+'), 2) as b7,
 STRTOK('a?b?c', ifnull('?', '?'), 2) as b8,
 STRTOK('a(b)c', ifnull('()', '()'), 2) as b9,
 STRTOK('a{b}c', ifnull('{}', '{}'), 2) as b10,
 STRTOK('a|b|c', ifnull('|', '|'), 2) as b11,
 STRTOK('a$b$c', ifnull('$', '$'), 2) as b12,
 STRTOK('user@snowflake.com', SUBSTR('.@^', 1, 2), 2) as b13,
 STRTOK('a./b/cg', ifnull('/.', '/.'), 3) as b14,

STRTOK('a$b/cg', '$/.', NULLIFZERO(1-1)) as c1,
STRTOK('a$b/cg', '$/.', 1) as c2,
STRTOK('a$b/c*g', '$/.', 2) as c3,

-- problematic
STRTOK('a.b\cg', ifnull('\.', '\.'), 3) as d1,
STRTOK('a.b\cg', '\.', NULLIFZERO(1-1)) as d2,
STRTOK('a.b\cg', '\.', 1) as d3,
STRTOK('a.b\cg', '\\', 1) as d4,
STRTOK('a.b\cg', '\.', 3) as d5,
STRTOK('a.b\cg', '\.', 4) as d6,
STRTOK('a.b\cg', '.', 2) as d7,
STRTOK('a.b\cg', '*', 2) as d8,

The transpiled query works except for the problematic statements that contain the escape character '' in the input or delimiter

exp.Soundex,
exp.SoundexP123,
exp.SplitPart,
exp.Strtok,
Copy link
Collaborator

@geooo109 geooo109 Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets also include a test for this (in the annotate_functions)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a test for Strtok already (4955)

@fivetran-felixhuang fivetran-felixhuang force-pushed the transpile_strtok_snowflake_duckdb branch from e4beaf1 to f3ed4ca Compare March 13, 2026 20:06
@fivetran-felixhuang fivetran-felixhuang marked this pull request as ready for review March 13, 2026 20:18
@github-actions
Copy link
Contributor

SQLGlot Integration Test Results

Comparing:

  • this branch (sqlglot:transpile_strtok_snowflake_duckdb, sqlglot version: transpile_strtok_snowflake_duckdb)
  • baseline (main, sqlglot version: 29.0.2.dev175)

⚠️ Limited to dialects: bigquery, duckdb, snowflake

By Dialect

dialect main sqlglot:transpile_strtok_snowflake_duckdb transitions links
bigquery -> bigquery 23844/23876 passed (99.9%) 23844/23876 passed (99.9%) No change full result / delta
bigquery -> duckdb 2078/2623 passed (79.2%) 2078/2623 passed (79.2%) No change full result / delta
duckdb -> duckdb 4003/4003 passed (100.0%) 4003/4003 passed (100.0%) No change full result / delta
snowflake -> duckdb 1645/2642 passed (62.3%) 1645/2642 passed (62.3%) No change full result / delta
snowflake -> snowflake 65878/65878 passed (100.0%) 65865/65878 passed (100.0%) 13 pass -> fail full result / delta

Overall

main: 99022 total, 97448 passed (pass rate: 98.4%), sqlglot version: 29.0.2.dev175

sqlglot:transpile_strtok_snowflake_duckdb: 99022 total, 97435 passed (pass rate: 98.4%), sqlglot version: transpile_strtok_snowflake_duckdb

Transitions:
13 pass -> fail

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants