Apache DataFusion SQL Query Engine

arrow big-data dataframe datafusion olap python query-engine rust sql
31 Open Issues Need Help Last updated: Sep 4, 2025

Open Issues Need Help

View All on GitHub

AI Summary: The current `DataFrame.cache()` implementation materializes the entire dataset into memory on a single node, which is inefficient and unsuitable for distributed environments like Ballista. It also eagerly executes the caching operation, deviating from Spark's lazy `cache()` semantics. The proposed solution is to represent `cache` as a logical plan, allowing for proper distributed execution and lazy evaluation.

Complexity: 4/5
enhancement good first issue help wanted

Apache DataFusion SQL Query Engine

Rust
#arrow#big-data#dataframe#datafusion#olap#python#query-engine#rust#sql

Apache DataFusion SQL Query Engine

Rust
#arrow#big-data#dataframe#datafusion#olap#python#query-engine#rust#sql

Apache DataFusion SQL Query Engine

Rust
#arrow#big-data#dataframe#datafusion#olap#python#query-engine#rust#sql

Apache DataFusion SQL Query Engine

Rust
#arrow#big-data#dataframe#datafusion#olap#python#query-engine#rust#sql
enhancement good first issue help wanted proto

Apache DataFusion SQL Query Engine

Rust
#arrow#big-data#dataframe#datafusion#olap#python#query-engine#rust#sql

Apache DataFusion SQL Query Engine

Rust
#arrow#big-data#dataframe#datafusion#olap#python#query-engine#rust#sql
enhancement help wanted

Apache DataFusion SQL Query Engine

Rust
#arrow#big-data#dataframe#datafusion#olap#python#query-engine#rust#sql
enhancement good first issue

Apache DataFusion SQL Query Engine

Rust
#arrow#big-data#dataframe#datafusion#olap#python#query-engine#rust#sql
enhancement good first issue EPIC

Apache DataFusion SQL Query Engine

Rust
#arrow#big-data#dataframe#datafusion#olap#python#query-engine#rust#sql

Apache DataFusion SQL Query Engine

Rust
#arrow#big-data#dataframe#datafusion#olap#python#query-engine#rust#sql

AI Summary: The task is to enable the `clippy::clone_on_ref_ptr` lint across the entire Apache DataFusion workspace. This involves adding the `[lints] workspace = true` configuration to the workspace's `Cargo.toml` file to ensure all sub-crates inherit the lint setting, potentially addressing any remaining instances where the lint is not applied.

Complexity: 2/5
good first issue

Apache DataFusion SQL Query Engine

Rust
#arrow#big-data#dataframe#datafusion#olap#python#query-engine#rust#sql

AI Summary: Update the Apache DataFusion project's workspace to use Rust 1.89. This involves following the project's documented upgrade procedure, likely including updating the `rust-toolchain` file and resolving any resulting compilation issues or dependency conflicts.

Complexity: 3/5
enhancement good first issue

Apache DataFusion SQL Query Engine

Rust
#arrow#big-data#dataframe#datafusion#olap#python#query-engine#rust#sql

AI Summary: The task involves rewriting a filter expression within the DataFusion query engine's `FilterExec` component to operate on the expression itself rather than processing the data. This optimization aims to improve performance by avoiding unnecessary data manipulation, mirroring a similar optimization already implemented in the `ParquetOpener` component. The solution requires understanding DataFusion's query execution architecture and potentially modifying the `FilterExec` code to handle expression rewriting.

Complexity: 4/5
enhancement good first issue

Apache DataFusion SQL Query Engine

Rust
#arrow#big-data#dataframe#datafusion#olap#python#query-engine#rust#sql

AI Summary: The task requires formatting the code examples within the docstrings of the Apache DataFusion project using the `rustfmt` tool with the `format_code_in_doc_comments` option enabled. This involves updating the project's CI to automatically enforce this formatting.

Complexity: 3/5
good first issue

Apache DataFusion SQL Query Engine

Rust
#arrow#big-data#dataframe#datafusion#olap#python#query-engine#rust#sql

AI Summary: Refactor the Apache DataFusion `SpillManager` to combine the similar functions `spill_record_batch_by_size_and_return_max_batch_memory()` and `spill_record_batch_by_size()`, likely by removing one and integrating its functionality into the other to reduce code duplication and improve maintainability.

Complexity: 2/5
enhancement good first issue

Apache DataFusion SQL Query Engine

Rust
#arrow#big-data#dataframe#datafusion#olap#python#query-engine#rust#sql

AI Summary: The task is to investigate and resolve an inconsistency between the `AsyncScalarUDFImpl::invoke_async_with_args` and `ScalarUDFImpl::invoke_with_args` functions in the Apache DataFusion project. Both functions should ideally return a `ColumnarValue`, and the current discrepancy using `ArrayRef` in one needs to be analyzed and corrected for consistency.

Complexity: 4/5
good first issue

Apache DataFusion SQL Query Engine

Rust
#arrow#big-data#dataframe#datafusion#olap#python#query-engine#rust#sql

AI Summary: Implement the Spark SQL `last_day` datetime function within the Apache DataFusion `datafusion-spark` crate. This involves referencing existing test files and a Sail implementation, adding the function, and creating necessary tests and documentation. The task leverages existing DataFusion infrastructure and follows established contribution guidelines.

Complexity: 3/5
enhancement good first issue spark

Apache DataFusion SQL Query Engine

Rust
#arrow#big-data#dataframe#datafusion#olap#python#query-engine#rust#sql

AI Summary: Implement the Spark `next_day` date function within the Apache DataFusion `datafusion-spark` crate. This involves referencing existing test files and a Sail implementation, adding the function, and creating necessary tests and documentation. The task leverages existing resources and follows established patterns within the DataFusion project.

Complexity: 3/5
enhancement good first issue spark

Apache DataFusion SQL Query Engine

Rust
#arrow#big-data#dataframe#datafusion#olap#python#query-engine#rust#sql

AI Summary: The task involves replacing several hardcoded floating-point constants related to π in the Apache DataFusion project with the newly stabilized `next_up` and `next_down` methods in Rust 1.86. This requires updating relevant functions in `datafusion/common/src/scalar/mod.rs`, verifying the updated values against existing constants, and documenting any precision differences.

Complexity: 2/5
good first issue

Apache DataFusion SQL Query Engine

Rust
#arrow#big-data#dataframe#datafusion#olap#python#query-engine#rust#sql

AI Summary: Enhance the display of the `OutputRequirementExec` in Apache DataFusion's query execution plan visualization to include the `order_requirement` and `dist_requirement` information, providing more detailed insights into the execution plan.

Complexity: 3/5
good first issue

Apache DataFusion SQL Query Engine

Rust
#arrow#big-data#dataframe#datafusion#olap#python#query-engine#rust#sql

AI Summary: Update the Apache DataFusion project's workspace to use Rust 1.88. This involves following the project's documented upgrade procedure, likely including updating the `rust-toolchain` file and resolving any resulting compilation issues or dependency conflicts.

Complexity: 3/5
enhancement good first issue development-process

Apache DataFusion SQL Query Engine

Rust
#arrow#big-data#dataframe#datafusion#olap#python#query-engine#rust#sql

AI Summary: The task is to modify the Apache DataFusion query engine to prevent users from simultaneously specifying both `order_by` and `within_group` clauses in SQL queries. This likely involves updating the query parser and planner to detect and reject such invalid combinations.

Complexity: 4/5
bug good first issue

Apache DataFusion SQL Query Engine

Rust
#arrow#big-data#dataframe#datafusion#olap#python#query-engine#rust#sql

AI Summary: Debug and fix a bug in Apache DataFusion's query optimizer where filter pushdown optimization is failing to remove a `SortExec` node when it should, resulting in suboptimal query execution plans. The issue occurs when `datafusion.execution.parquet.pushdown_filters` is set to true and involves investigating the equivalence information handling within the optimizer's filter pushdown rules.

Complexity: 4/5
bug help wanted

Apache DataFusion SQL Query Engine

Rust
#arrow#big-data#dataframe#datafusion#olap#python#query-engine#rust#sql

AI Summary: Create a script that automatically detects breaking API changes in the Apache DataFusion project by analyzing commit history and comparing against the project's API health guidelines. The script should identify changes that meet the criteria for breaking changes as defined in the documentation and flag them appropriately, ideally adding a 'breaking change' label to relevant issues or pull requests.

Complexity: 4/5
help wanted development-process

Apache DataFusion SQL Query Engine

Rust
#arrow#big-data#dataframe#datafusion#olap#python#query-engine#rust#sql

AI Summary: Document the `DESCRIBE` (and `DESC`) command within the Apache DataFusion project's `ddl.md` file, including a description and usage examples.

Complexity: 2/5
documentation good first issue

Apache DataFusion SQL Query Engine

Rust
#arrow#big-data#dataframe#datafusion#olap#python#query-engine#rust#sql

AI Summary: Refactor the `StreamJoinMetrics` component of the Apache DataFusion project to reuse the existing `BaselineMetrics` component, improving code consistency and maintainability. This involves modifying the code to leverage the functionality already present in `BaselineMetrics` instead of duplicating it in `StreamJoinMetrics`.

Complexity: 4/5
enhancement good first issue

Apache DataFusion SQL Query Engine

Rust
#arrow#big-data#dataframe#datafusion#olap#python#query-engine#rust#sql

AI Summary: Refactor the `BuildProbeJoinMetrics` function within the Apache DataFusion Hash Join algorithm to reuse the existing `BaselineMetrics` structure, improving code efficiency and maintainability. This is a code refactoring task aimed at reducing redundancy and improving the overall design of the metrics collection in the hash join implementation.

Complexity: 4/5
enhancement good first issue

Apache DataFusion SQL Query Engine

Rust
#arrow#big-data#dataframe#datafusion#olap#python#query-engine#rust#sql

AI Summary: Refactor the `UnnestMetrics` code in the Apache DataFusion project to reuse the existing `BaselineMetrics` code, improving code maintainability and reducing redundancy. This is a code refactoring task aimed at improving the efficiency and structure of the project's metrics handling.

Complexity: 4/5
enhancement good first issue

Apache DataFusion SQL Query Engine

Rust
#arrow#big-data#dataframe#datafusion#olap#python#query-engine#rust#sql

AI Summary: Refactor the `SortMergeJoinMetrics` struct in the Apache DataFusion project to inherit from or utilize the existing `BaselineMetrics` struct, thereby ensuring consistency and reducing redundancy in metric tracking for the Sort-Merge Join operator. This involves modifying the `SortMergeJoinMetrics` definition to include `BaselineMetrics` and potentially adjusting related code to reflect the change.

Complexity: 3/5
enhancement good first issue

Apache DataFusion SQL Query Engine

Rust
#arrow#big-data#dataframe#datafusion#olap#python#query-engine#rust#sql

AI Summary: The task is to enhance the `datafusion-cli` to correctly handle Parquet files located within folders. Currently, it fails to recognize a directory containing Parquet files and throws an error. The solution involves modifying the CLI's file parsing logic to identify and interpret Parquet files within a given directory, effectively treating the directory as a wildcard expression (e.g., `/tmp/t1/*.parquet`). This requires updating the way the CLI handles file paths and potentially adding support for glob patterns.

Complexity: 4/5
good first issue

Apache DataFusion SQL Query Engine

Rust
#arrow#big-data#dataframe#datafusion#olap#python#query-engine#rust#sql

AI Summary: The task is to enhance Apache DataFusion's query engine to support projecting columns (like `window_start`, `window_end`, `window_duration`) that are generated during windowing operations but don't exist in the original table schema. The current implementation only allows projecting aggregated values, not window metadata. A solution needs to be found that avoids unnatural workarounds like dummy aggregate UDFs with arguments.

Complexity: 4/5
help wanted

Apache DataFusion SQL Query Engine

Rust
#arrow#big-data#dataframe#datafusion#olap#python#query-engine#rust#sql