Skip to content

New Parquet-Variant encoding#7130

Open
AdamGS wants to merge 2 commits intodevelopfrom
adamg/parquet-variant
Open

New Parquet-Variant encoding#7130
AdamGS wants to merge 2 commits intodevelopfrom
adamg/parquet-variant

Conversation

@AdamGS
Copy link
Copy Markdown
Contributor

@AdamGS AdamGS commented Mar 23, 2026

Summary

This PR introduces a new encoding to support Arrow's canonical extension type, but does not integrate it with anything else.

It does include basic pieces like slice/take/filter, and it can (or at least should) roundtrip with the equivalent arrow array.

Testing

The encoding includes a bunch of basic tests, both for making sure it roundtrips with arrow and for the various nullability cases.

@AdamGS AdamGS added the changelog/feature A new feature label Mar 23, 2026
@AdamGS AdamGS force-pushed the adamg/parquet-variant branch 2 times, most recently from 62280a5 to f266c90 Compare March 23, 2026 16:21
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Mar 23, 2026

Merging this PR will not alter performance

✅ 1106 untouched benchmarks
⏩ 1522 skipped benchmarks1


Comparing adamg/parquet-variant (04c60a8) with develop (b524763)

Open in CodSpeed

Footnotes

  1. 1522 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@a10y
Copy link
Copy Markdown
Contributor

a10y commented Mar 23, 2026

It looks like this supports one typed_value per column. How would you represent something like this example, which shreds out multiple nested columns from an object?

image

Physically I think this differs from shredding out a struct column with two fields.

@AdamGS
Copy link
Copy Markdown
Contributor Author

AdamGS commented Mar 24, 2026

I think in that case it is a struct, because they are named

@AdamGS
Copy link
Copy Markdown
Contributor Author

AdamGS commented Mar 24, 2026

typed_value will be a struct (I think that's "object" in variant terms), but not directly of the event_type and event_ts types, but of ParquetVariant arrays that, but roughly:

Struct {
    "event_ts": Variant {
        ..,
        typed_value: Timestamp
    },
    "event_type": Variant {
        ..,
        typed_value: String
    },
}

@AdamGS AdamGS force-pushed the adamg/parquet-variant branch 3 times, most recently from 3d7ecb4 to ad93dcf Compare March 24, 2026 13:28
@AdamGS AdamGS requested review from a10y, connortsui20 and gatesn and removed request for a10y, connortsui20 and gatesn March 24, 2026 15:33
AdamGS added a commit that referenced this pull request Mar 24, 2026
## Summary

Filling up missing parts of making Variant a canonical array. It still
not fully supported but this is a step towards it.

These changes started as part of
#7130, but I figured they are
just noise in that already too big of a PR.

## API Changes

Makes variant officially a canonical array and type.

---------

Signed-off-by: Adam Gutglick <adam@spiraldb.com>
@AdamGS AdamGS force-pushed the adamg/parquet-variant branch 2 times, most recently from 6e32d0f to fbf4d4e Compare March 25, 2026 12:36
@AdamGS AdamGS marked this pull request as ready for review March 25, 2026 14:18
@AdamGS AdamGS force-pushed the adamg/parquet-variant branch from fbf4d4e to 4848c82 Compare March 25, 2026 14:18
@AdamGS AdamGS requested review from a10y, gatesn and robert3005 March 25, 2026 14:19
@AdamGS AdamGS force-pushed the adamg/parquet-variant branch 2 times, most recently from e5a799e to 338af8a Compare March 25, 2026 16:07
@AdamGS AdamGS requested a review from connortsui20 March 26, 2026 14:52
Copy link
Copy Markdown
Contributor

@connortsui20 connortsui20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I gave it a quick pass but haven't looked at the scalar operations yet

Validity::from(BitBuffer::from(nulls.inner().clone()))
}
})
.unwrap_or(Validity::NonNullable);
Copy link
Copy Markdown
Contributor

@connortsui20 connortsui20 Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this guaranteed on the arrow side that None implies non-nullable? I feel like I remember running some oddness around this before...

Maybe worth a comment if this is totally fine

Edit: Actually looking down below it seems like both Validity::NonNullable | Validity::AllValid => None so I'm not actually sure what to do here?

Edit again: We have this code in vortex-array/src/arrow/convert.rs, I think we must pass nullability into from_arrow_variant to make this correct?

fn nulls(nulls: Option<&NullBuffer>, nullable: bool) -> Validity {
    if nullable {
        nulls
            .map(|nulls| {
                if nulls.null_count() == nulls.len() {
                    Validity::AllInvalid
                } else {
                    Validity::from(BitBuffer::from(nulls.inner().clone()))
                }
            })
            .unwrap_or_else(|| Validity::AllValid)
    } else {
        assert!(nulls.map(|x| x.null_count() == 0).unwrap_or(true));
        Validity::NonNullable
    }
}

Comment on lines +164 to +165
let pv =
ParquetVariantArray::try_new_with_validity(validity, metadata, value, typed_value)?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think it makes sense to just have try_new that takes validity, and also we could consider having a new_unchecked that does no validation here instead since we trust that the arrow array is well-formed (lengths are correct).

Comment on lines +172 to +185
let nulls = match &self.validity {
Validity::NonNullable | Validity::AllValid => None,
Validity::AllInvalid => Some(NullBuffer::new_null(len)),
Validity::Array(array) => {
let mask = array.clone().execute::<Mask>(ctx)?;
match mask {
Mask::AllTrue(_) => None,
Mask::AllFalse(l) => Some(NullBuffer::new_null(l)),
Mask::Values(values) => Some(NullBuffer::from(
arrow_buffer::BooleanBuffer::from(values.bit_buffer().clone()),
)),
}
}
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have use vortex::array::arrow::executor::validity::to_arrow_null_buffer; for this which is pub(super) right now but arguably should be pub.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will take a look

Comment on lines +48 to +49
ParquetVariantArray::try_new_with_validity(validity, metadata, value, typed_value)?
.into_array(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care about an unchecked constructor? If we're just validating lengths it's probably not worth it...

Comment on lines +131 to +145
match (&array.value, &other.value) {
(Some(a), Some(b)) => {
if !a.array_eq(b, precision) {
return false;
}
}
(None, None) => {}
_ => return false,
}
match (&array.typed_value, &other.typed_value) {
(Some(a), Some(b)) => a.array_eq(b, precision),
(None, None) => true,
_ => false,
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to do array equality via structural equality or value equality? Since we can have different representations for the same data we might want to just do a loop of scalar_at in here (after the validity and metadata checks).

Comment on lines +56 to +57
pub(crate) value: Option<ArrayRef>,
pub(crate) typed_value: Option<ArrayRef>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Crazy idea after a quick pass over everything, is it worth having a dedicated enum that enforces at least 1 of these exist? There's a lot of logic that has to check for that everywhere (including in some free standing functions). Maybe that's overkill though and introduces more complexity than it is worth?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely a good idea

Copy link
Copy Markdown
Contributor Author

@AdamGS AdamGS Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was re-reading the arrow implementation, and I think all four states are actually valid there. Both will be missing when you declare a shredding schema with a field that doesn't exist it the actual data.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They acknowledge that its not a strict adherence to the spec, not sure what to do about this for now.

@AdamGS AdamGS force-pushed the adamg/parquet-variant branch from 338af8a to 004e388 Compare March 27, 2026 13:11
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
@AdamGS AdamGS force-pushed the adamg/parquet-variant branch from 004e388 to 863435c Compare March 27, 2026 13:19
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants