An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)

document document-analysis document-data-extraction document-information-extraction extraction llm-ocr llms machine-learning nlp ocr ocr-benchmark ocr-onpremise onprem onprem-ocr onprem-vision onpremise rag table-extraction unstructured-data vlms
3 Open Issues Need Help Last updated: Sep 13, 2025

Open Issues Need Help

View All on GitHub
good first issue

An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)

Python
#document#document-analysis#document-data-extraction#document-information-extraction#extraction#llm-ocr#llms#machine-learning#nlp#ocr#ocr-benchmark#ocr-onpremise#onprem#onprem-ocr#onprem-vision#onpremise#rag#table-extraction#unstructured-data#vlms

An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)

Python
#document#document-analysis#document-data-extraction#document-information-extraction#extraction#llm-ocr#llms#machine-learning#nlp#ocr#ocr-benchmark#ocr-onpremise#onprem#onprem-ocr#onprem-vision#onpremise#rag#table-extraction#unstructured-data#vlms

AI Summary: Determine if the docext toolkit supports concurrent execution of multiple models, specifically Nanonets-OCR-s for PDF to Markdown conversion and Qwen2.5-VL-7B-Instruct-AWQ for information extraction. This involves reviewing the docext documentation and potentially testing the concurrent model execution.

Complexity: 4/5
enhancement help wanted

An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)

Python
#document#document-analysis#document-data-extraction#document-information-extraction#extraction#llm-ocr#llms#machine-learning#nlp#ocr#ocr-benchmark#ocr-onpremise#onprem#onprem-ocr#onprem-vision#onpremise#rag#table-extraction#unstructured-data#vlms