Open Issues Need Help
View All on GitHubAI Summary: The current `DataFrame.cache()` implementation materializes the entire dataset into memory on a single node, which is inefficient and unsuitable for distributed environments like Ballista. It also eagerly executes the caching operation, deviating from Spark's lazy `cache()` semantics. The proposed solution is to represent `cache` as a logical plan, allowing for proper distributed execution and lazy evaluation.
Apache DataFusion SQL Query Engine
Apache DataFusion SQL Query Engine
Apache DataFusion SQL Query Engine
Apache DataFusion SQL Query Engine
Apache DataFusion SQL Query Engine
Apache DataFusion SQL Query Engine
Apache DataFusion SQL Query Engine
Apache DataFusion SQL Query Engine
Apache DataFusion SQL Query Engine
Apache DataFusion SQL Query Engine
AI Summary: The task is to enable the `clippy::clone_on_ref_ptr` lint across the entire Apache DataFusion workspace. This involves adding the `[lints] workspace = true` configuration to the workspace's `Cargo.toml` file to ensure all sub-crates inherit the lint setting, potentially addressing any remaining instances where the lint is not applied.
Apache DataFusion SQL Query Engine
AI Summary: Update the Apache DataFusion project's workspace to use Rust 1.89. This involves following the project's documented upgrade procedure, likely including updating the `rust-toolchain` file and resolving any resulting compilation issues or dependency conflicts.
Apache DataFusion SQL Query Engine
AI Summary: The task involves rewriting a filter expression within the DataFusion query engine's `FilterExec` component to operate on the expression itself rather than processing the data. This optimization aims to improve performance by avoiding unnecessary data manipulation, mirroring a similar optimization already implemented in the `ParquetOpener` component. The solution requires understanding DataFusion's query execution architecture and potentially modifying the `FilterExec` code to handle expression rewriting.
Apache DataFusion SQL Query Engine
AI Summary: The task requires formatting the code examples within the docstrings of the Apache DataFusion project using the `rustfmt` tool with the `format_code_in_doc_comments` option enabled. This involves updating the project's CI to automatically enforce this formatting.
Apache DataFusion SQL Query Engine
AI Summary: Refactor the Apache DataFusion `SpillManager` to combine the similar functions `spill_record_batch_by_size_and_return_max_batch_memory()` and `spill_record_batch_by_size()`, likely by removing one and integrating its functionality into the other to reduce code duplication and improve maintainability.
Apache DataFusion SQL Query Engine
AI Summary: The task is to investigate and resolve an inconsistency between the `AsyncScalarUDFImpl::invoke_async_with_args` and `ScalarUDFImpl::invoke_with_args` functions in the Apache DataFusion project. Both functions should ideally return a `ColumnarValue`, and the current discrepancy using `ArrayRef` in one needs to be analyzed and corrected for consistency.
Apache DataFusion SQL Query Engine
AI Summary: Implement the Spark SQL `last_day` datetime function within the Apache DataFusion `datafusion-spark` crate. This involves referencing existing test files and a Sail implementation, adding the function, and creating necessary tests and documentation. The task leverages existing DataFusion infrastructure and follows established contribution guidelines.
Apache DataFusion SQL Query Engine
AI Summary: Implement the Spark `next_day` date function within the Apache DataFusion `datafusion-spark` crate. This involves referencing existing test files and a Sail implementation, adding the function, and creating necessary tests and documentation. The task leverages existing resources and follows established patterns within the DataFusion project.
Apache DataFusion SQL Query Engine
AI Summary: The task involves replacing several hardcoded floating-point constants related to π in the Apache DataFusion project with the newly stabilized `next_up` and `next_down` methods in Rust 1.86. This requires updating relevant functions in `datafusion/common/src/scalar/mod.rs`, verifying the updated values against existing constants, and documenting any precision differences.
Apache DataFusion SQL Query Engine
AI Summary: Enhance the display of the `OutputRequirementExec` in Apache DataFusion's query execution plan visualization to include the `order_requirement` and `dist_requirement` information, providing more detailed insights into the execution plan.
Apache DataFusion SQL Query Engine
AI Summary: Update the Apache DataFusion project's workspace to use Rust 1.88. This involves following the project's documented upgrade procedure, likely including updating the `rust-toolchain` file and resolving any resulting compilation issues or dependency conflicts.
Apache DataFusion SQL Query Engine
AI Summary: The task is to modify the Apache DataFusion query engine to prevent users from simultaneously specifying both `order_by` and `within_group` clauses in SQL queries. This likely involves updating the query parser and planner to detect and reject such invalid combinations.
Apache DataFusion SQL Query Engine
AI Summary: Debug and fix a bug in Apache DataFusion's query optimizer where filter pushdown optimization is failing to remove a `SortExec` node when it should, resulting in suboptimal query execution plans. The issue occurs when `datafusion.execution.parquet.pushdown_filters` is set to true and involves investigating the equivalence information handling within the optimizer's filter pushdown rules.
Apache DataFusion SQL Query Engine
AI Summary: Create a script that automatically detects breaking API changes in the Apache DataFusion project by analyzing commit history and comparing against the project's API health guidelines. The script should identify changes that meet the criteria for breaking changes as defined in the documentation and flag them appropriately, ideally adding a 'breaking change' label to relevant issues or pull requests.
Apache DataFusion SQL Query Engine
AI Summary: Document the `DESCRIBE` (and `DESC`) command within the Apache DataFusion project's `ddl.md` file, including a description and usage examples.
Apache DataFusion SQL Query Engine
AI Summary: Refactor the `StreamJoinMetrics` component of the Apache DataFusion project to reuse the existing `BaselineMetrics` component, improving code consistency and maintainability. This involves modifying the code to leverage the functionality already present in `BaselineMetrics` instead of duplicating it in `StreamJoinMetrics`.
Apache DataFusion SQL Query Engine
AI Summary: Refactor the `BuildProbeJoinMetrics` function within the Apache DataFusion Hash Join algorithm to reuse the existing `BaselineMetrics` structure, improving code efficiency and maintainability. This is a code refactoring task aimed at reducing redundancy and improving the overall design of the metrics collection in the hash join implementation.
Apache DataFusion SQL Query Engine
AI Summary: Refactor the `UnnestMetrics` code in the Apache DataFusion project to reuse the existing `BaselineMetrics` code, improving code maintainability and reducing redundancy. This is a code refactoring task aimed at improving the efficiency and structure of the project's metrics handling.
Apache DataFusion SQL Query Engine
AI Summary: Refactor the `SortMergeJoinMetrics` struct in the Apache DataFusion project to inherit from or utilize the existing `BaselineMetrics` struct, thereby ensuring consistency and reducing redundancy in metric tracking for the Sort-Merge Join operator. This involves modifying the `SortMergeJoinMetrics` definition to include `BaselineMetrics` and potentially adjusting related code to reflect the change.
Apache DataFusion SQL Query Engine
AI Summary: The task is to enhance the `datafusion-cli` to correctly handle Parquet files located within folders. Currently, it fails to recognize a directory containing Parquet files and throws an error. The solution involves modifying the CLI's file parsing logic to identify and interpret Parquet files within a given directory, effectively treating the directory as a wildcard expression (e.g., `/tmp/t1/*.parquet`). This requires updating the way the CLI handles file paths and potentially adding support for glob patterns.
Apache DataFusion SQL Query Engine
AI Summary: The task is to enhance Apache DataFusion's query engine to support projecting columns (like `window_start`, `window_end`, `window_duration`) that are generated during windowing operations but don't exist in the original table schema. The current implementation only allows projecting aggregated values, not window metadata. A solution needs to be found that avoids unnatural workarounds like dummy aggregate UDFs with arguments.
Apache DataFusion SQL Query Engine