Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema merge failed since switch to Datafusion if a field is a list of structs #3339

Open
liamphmurphy opened this issue Mar 20, 2025 · 4 comments
Labels
bug Something isn't working on-hold Issues and Pull Requests that are on hold for some reason

Comments

@liamphmurphy
Copy link
Contributor

liamphmurphy commented Mar 20, 2025

Environment

Delta-rs version: v0.25.4 (see below for specifics)

Binding: Python, rust engine

Environment:
Local, S3


Bug

What happened:
Since the adoption of datafusion, it appears to struggling with schema merges if the originating table schema contains a list of structs (Pyarrow list for exact verbiage).

What you expected to happen:

Adding a non-list field to a schema with a list of structs field would merge, which worked previously.

How to reproduce it:

On v0.25.4, run the following Python code:

import pyarrow as pa
from deltalake import write_deltalake

# Define the path for the Delta table
delta_table_path = "./datafusion-repro-test-table"

# Define the data for the first write
data_first_write = [
    {
        "uid": "ws_2",
        "event": {
            "properties": {
                "fields": [
                    {
                        "messageId": "veniam sed et elit adipisicing"
                    }
                ],
            },
        }
    }
]

schema = pa.schema([
    pa.field("uid", pa.string()),
    pa.field("event", pa.struct([
        pa.field("properties", pa.struct([
            pa.field("fields", pa.list_(pa.struct([
                pa.field("messageId", pa.string()),
            ]))),
        ])),
    ])),
])

print(schema)



first_write = pa.Table.from_pylist(data_first_write, schema=schema)

# Write data to Delta table for the first write
write_deltalake(delta_table_path, first_write, mode="append", engine="rust", schema_mode="merge")

#### NOW FOR THE SECOND WRITE THAT BREAKS ####

data_second_write = [
    {
        "uid": "ws_2",
        "event": {
            "properties": {
                "someNewField": "test-value", # New field
                "fields": [
                    {
                        "messageId": "veniam sed et elit adipisicing"
                    }
                ],
            },
        }
    }
]

second_schema = pa.schema([
    pa.field("uid", pa.string()),
    pa.field("event", pa.struct([
        pa.field("properties", pa.struct([
            pa.field("someNewField", pa.string()), # New field
            pa.field("fields", pa.list_(pa.struct([
                pa.field("messageId", pa.string()),
            ]))),
        ])),
    ])),
])

second_write = pa.Table.from_pylist(data_second_write, schema=second_schema)

# Write data to Delta table for the second write
write_deltalake(delta_table_path, second_write, mode="append", engine="rust", schema_mode="merge")

More details:

The above code works as expected on the last version I was using, v0.19.2.

@liamphmurphy liamphmurphy added the bug Something isn't working label Mar 20, 2025
@liamphmurphy
Copy link
Contributor Author

liamphmurphy commented Mar 20, 2025

^ To clarify, if the schema does not contain a list field, the merge works as expected.

EDIT: If I switch the fields list to a list of strings instead of a list of structs, it works. So this seems to be a specific issue with a list of structs.

@liamphmurphy liamphmurphy changed the title Schema merge failed since switch to Datafusion if a field is a list Schema merge failed since switch to Datafusion if a field is a list of structs Mar 20, 2025
@ion-elgreco
Copy link
Collaborator

@liamphmurphy yeah this is not something we can fix in delta-rs, it's an unsupported cast in datafusion. Can you make an issue upstream please in https://github.com/apache/datafusion

cc @alamb

Error message:

This feature is not implemented: Unsupported CAST from Struct([Field { name: "properties", data_type: Struct([Field { name: "someNewField", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "fields", data_type: List(Field { name: "item", data_type: Struct([Field { name: "messageId", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]) to Struct([Field { name: "properties", data_type: Struct([Field { name: "fields", data_type: List(Field { name: "element", data_type: Struct([Field { name: "messageId", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "someNewField", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }])

@liamphmurphy
Copy link
Contributor Author

Bug issue report upstream: apache/datafusion#15338

@ynikolai
Copy link

The issue started with the release python-v0.25.0.
Delta-rs (or datafusion, I don't know) fails to add new subfield into a struct field at any position except the last.
The problem is not the data casting by itself, but the trying to write data into wrong columns.
Delta is adding new column into a struct at the end regardless of the new column position in the provided schema (this is the documented behavior) but then it is trying to write data columns in the order as they are in the provided schema.

Here is the minimal code to reproduce. The new column can be added at the 1st position or in the middle - it will fail the same.
If the new column is the last one in the struct - only then data is written as expected.
Adding a new string or list(string) field will "succeed" (not shown in the code) but only because other data types can convert into string, silently writing wrong data into an unintended column.

from deltalake import write_deltalake
import pyarrow as pa


def write_table_v1(table_path):
    schema_v1 = pa.schema([
        pa.field(
            "c1",
            pa.struct([
                pa.field("c2", pa.string()),
                pa.field("c3", pa.string()),
            ])
        )
    ])
    data = [{"c1": {"c2": "v2", "c3": "v3"}}]
    table = pa.Table.from_pylist(data, schema_v1)
    write_deltalake(
        table_or_uri=table_path,
        data=table,
        schema=schema_v1,
        mode="append",
        schema_mode="merge",
        engine="rust",
    )


def write_table_v2(table_path, new_field_type):
    schema_v2 = pa.schema([
        pa.field(
            "c1",
            pa.struct([
                pa.field("c2", pa.string()),
                pa.field("new_field", new_field_type),
                pa.field("c3", pa.string()),
            ])
        )
    ])
    data = [
        {
            "c1": {
                "c2": "v2",
                "new_field": None,
                "c3": "v3",
            }
        }
    ]
    table = pa.Table.from_pylist(data, schema_v2)
    write_deltalake(
        table_or_uri=table_path,
        data=table,
        schema=schema_v2,
        mode="append",
        schema_mode="merge",
        engine="rust",
    )


write_table_v1("table1")
for field_type in (pa.bool_(), pa.int64(), pa.list_(pa.int64())):
    try:
        write_table_v2("table1", field_type)
    except Exception as e:
        print(e)

Output

Cast error: Cannot cast value 'v3' to value of Boolean type
Cast error: Cannot cast string 'v3' to value of Int64 type
Cast error: Cannot cast string 'v3' to value of Int64 type

@ion-elgreco ion-elgreco added the on-hold Issues and Pull Requests that are on hold for some reason label Mar 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working on-hold Issues and Pull Requests that are on hold for some reason
Projects
None yet
Development

No branches or pull requests

3 participants