Synthetic Text Dataset Generation for LLM projects

3 Open Issues Need Help Last updated: Jun 17, 2025

Open Issues Need Help

View All on GitHub

AI Summary: The task is to modify the `datafast` Python library's `push_to_hub` function within the `ClassificationDataset` class. The change should make the default visibility of uploaded datasets 'private' instead of 'public'. This involves locating the relevant code section, identifying the parameter controlling visibility, and altering it to set the default to private.

Complexity: 3/5
bug good first issue

Synthetic Text Dataset Generation for LLM projects

Python

AI Summary: Experiment with different methods to improve the diversity of text generated by LLMs within the Datafast project. This involves exploring techniques like keyword extraction, counting, and injection into prompts to prevent repetition, as well as analyzing sentence structure using n-grams to encourage more varied outputs. The goal is to create Jupyter notebooks demonstrating these experimental approaches before full implementation.

Complexity: 3/5
enhancement good first issue

Synthetic Text Dataset Generation for LLM projects

Python

AI Summary: Implement a `.inspect()` method for Datafast's dataset objects. This method should launch a Gradio app allowing users to interactively browse and optionally edit, delete, validate, or rate generated dataset samples in a user-friendly way, improving the dataset review process before pushing to Hugging Face Hub.

Complexity: 4/5
enhancement good first issue

Synthetic Text Dataset Generation for LLM projects

Python