diff --git a/packages/bigframes/noxfile.py b/packages/bigframes/noxfile.py index 5dba688d3c4a..09364c4e6ff9 100644 --- a/packages/bigframes/noxfile.py +++ b/packages/bigframes/noxfile.py @@ -605,11 +605,11 @@ def prerelease(session: nox.sessions.Session, tests_path, extra_pytest_options=( # Workaround https://github.com/googleapis/python-db-dtypes-pandas/issues/178 "db-dtypes", # Ensure we catch breaking changes in the client libraries early. - "git+https://github.com/googleapis/python-bigquery.git#egg=google-cloud-bigquery", + "git+https://github.com/googleapis/google-cloud-python.git#egg=google-cloud-bigquery&subdirectory=packages/google-cloud-bigquery", "--upgrade", "-e", "git+https://github.com/googleapis/google-cloud-python.git#egg=google-cloud-bigquery-storage&subdirectory=packages/google-cloud-bigquery-storage", - "git+https://github.com/googleapis/python-bigquery-pandas.git#egg=pandas-gbq", + "git+https://github.com/googleapis/google-cloud-python.git#egg=pandas-gbq&subdirectory=packages/pandas-gbq", ) # Print out prerelease package versions. diff --git a/packages/bigframes/specs/bigframes-bigquery-contributing.md b/packages/bigframes/specs/bigframes-bigquery-contributing.md new file mode 100644 index 000000000000..10931af0755f --- /dev/null +++ b/packages/bigframes/specs/bigframes-bigquery-contributing.md @@ -0,0 +1,501 @@ +# bigframes.bigquery inputs and outputs policies + +The goal of the [bigframes.bigquery +APIs](https://dataframes.bigquery.dev/reference/api/bigframes.bigquery.html#module-bigframes.bigquery) +is to provide the simplest possible mapping from BigQuery (GoogleSQL) +[functions](https://docs.cloud.google.com/bigquery/docs/reference/standard-sql/functions-all) +and +[operations](https://docs.cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax) +to Python. "Simplest" is somewhat ambiguous though, when it comes to the types +involved and behaviors, so this document aims to expand on that vision with +specific examples. + +## SQL and BigFrames expression types + +
| SQL expression type(s) + | +Python type(s) + | +Notes + | +Examples + | +
| Column expression (usable in a SELECT clause) + | ++ + | +Both Python Series and column expression should be supported as inputs,
+ with the output reflecting the users input. Use a TypeVar
+ rather than directly using union types to make type checking easier.
+ +Special considerations for Series inputs: + +If an input and output are both a Series with the same number of rows, make sure +the output Series is implicitly (row identity) alignable with the original +input. In other words, don't generate a table expression. + +If there are multiple Series inputs, they should be implicitly aligned if +possible so as not to generate unnecessary table expressions. + |
+ Most scalar functions accept one or more column expressions as input. + | +
| Scalar values + | ++ + | +Theoretically, we could try to get the type system to help the user + disambiguate between this case and the "Column expression" case, but I think + that's more trouble than it it's worth with regards to the expectations of + Python users. + | +
|
+
| Table expression + | +bpd.DataFrame
+ +All columns are included as normal columns in the input table expression, +including named index columns. If column names aren't unique or contain +characters not compatible with BigQuery flexible column names, raise an error. + +Outputs are unordered and unindexed to allow for cleaner mapping with SQL. + |
+ Most APIs that take a table expression as input, also output a table
+ expression with the same number of rows and passing through all unused
+ columns.
+
+ This should be used to pass through any index or ordering columns (as well + as all other columns, if that's the SQL behavior), to allow for easy joining + with the original input DataFrame. + |
+ Same number of rows as the input, so we should preserve index and ordering:
+
+
+ Different number of rows in output, so no need to preserve index or ordering. + Default index / ordering should be specified with the Session's + configuration: + +
+ Possible to have the same number of rows as the input, but joining with the original goes against the purpose of the feature: + + + + |
+
| Table name + | +string (referring to fully-qualified table ID, e.g. project.dataset.table / project.catalog.namespace.table) + | +Some SQL APIs do not support or have limitations with arbitrary table expressions, instead taking in a table ID, such as TABLESAMPLE expression.
+ +Also, SEARCH and VECTOR_SEARCH, if you want the indexes attached to the table to actually apply. + +For outputs, it might be preferable to output a table ID instead of a DataFrame, if the user is explicitly creating a table. For example, to_gbq() returns a string with the table name, which is useful for the case where BigFrame generates the table ID for the user. + |
+ All of the items from the "Table expression" row above. APIs that require a table expression, but don't take a table ID can trivially take a table ID through a (SELECT * FROM table) subquery.
+ +Some APIs only take a table ID and not an arbitrary table expression:
|
+
| Aggregated table expression + | +DataFrameGroupBy + | ++ | ++ + | +
| Analytic table expression + | +
|
+ + | +
|
+
| Column name (unqualified*) \ + \ +*I've only encountered examples where the table name / table expression is passed in separately. + | +string,
+ +For cases where the column name is used as an alias and we aren't using named Series: + +dict[str, Expression] + |
+ Often a table expression input is paired with a column name input, as is the case with the CREATE MODEL and VECTOR_SEARCH APIs
+ +If SQL expects a column name rather than a column expression, do not attempt to change this in Python. For example, don't allow a Series as a substitute for DataFrames + Column name. \ + \ +If the associated table expression is input as a DataFrame, validate that these map cleanly to SQL and raise a ValueError if not. For example: \ +
|
+ + + | +
| Literal values + | +corresponding literal Python value (e.g. int, float, string) + | +For cases where scalar values are also supported, it should be safe to start with this and then expand to support expressions without a breaking change, as is done in https://github.com/googleapis/google-cloud-python/pull/16606. + | +Most scalar functions accept one or more literal values as input. + | +
| Scalar subqueries + | +Not supported yet, except implicitly in some aggregation use cases.
+ +Would need some sort of bigframes deferred expression that can be tied to a table expression. + +(Possibly DataFrame with 1 column?) + |
+ + | ++ | +
| Scalar ops - Input type(s) + | +Scalar ops - Output type + | +
| Expression + | +Expression + | +
| Series / DataFrame + | +Series / DataFrame
+ +Preserve ordering and index(es). Join inputs as needed before applying the operation. + |
+
| Mix of Expression and Series / DataFrame + | +Series / DataFrame
+ +Preserve ordering and index(es). Join inputs as needed before applying the operation. + |
+