Synthetic Text Dataset Generation for LLM projects

40 stars 5 forks 40 watchers Python Apache License 2.0
5 Open Issues Need Help Last updated: Sep 16, 2025

Open Issues Need Help

View All on GitHub
good first issue In Progress

Synthetic Text Dataset Generation for LLM projects

Python
enhancement good first issue

Synthetic Text Dataset Generation for LLM projects

Python

AI Summary: The task is to modify the `datafast` Python library's `push_to_hub` function within the `ClassificationDataset` class. The change should make the default visibility of uploaded datasets 'private' instead of 'public'. This involves locating the relevant code section, identifying the parameter controlling visibility, and altering it to set the default to private.

Complexity: 3/5
bug good first issue

Synthetic Text Dataset Generation for LLM projects

Python

AI Summary: Experiment with different methods to improve the diversity of text generated by LLMs within the Datafast project. This involves exploring techniques like keyword extraction, counting, and injection into prompts to prevent repetition, as well as analyzing sentence structure using n-grams to encourage more varied outputs. The goal is to create Jupyter notebooks demonstrating these experimental approaches before full implementation.

Complexity: 3/5
enhancement good first issue

Synthetic Text Dataset Generation for LLM projects

Python

AI Summary: Implement a `.inspect()` method for Datafast's dataset objects. This method should launch a Gradio app allowing users to interactively browse and optionally edit, delete, validate, or rate generated dataset samples in a user-friendly way, improving the dataset review process before pushing to Hugging Face Hub.

Complexity: 4/5
enhancement good first issue

Synthetic Text Dataset Generation for LLM projects

Python