NanoNets/docext

An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)

1.7K stars 130 forks 1.7K watchers Python Apache License 2.0

document document-analysis document-data-extraction document-information-extraction extraction llm-ocr llms machine-learning nlp ocr ocr-benchmark ocr-onpremise onprem onprem-ocr onprem-vision onpremise rag table-extraction unstructured-data vlms

View on GitHub Website

3 Open Issues Need Help Last updated: Sep 13, 2025

Open Issues Need Help

View All on GitHub

Benchmark files missing when installed 9 months ago

good first issue

NanoNets/docext

1.7K

An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)

Python

#document#document-analysis#document-data-extraction#document-information-extraction#extraction#llm-ocr#llms#machine-learning#nlp#ocr#ocr-benchmark#ocr-onpremise#onprem#onprem-ocr#onprem-vision#onpremise#rag#table-extraction#unstructured-data#vlms

Image extraction from pdfs can reduce resolution and alter aspect ratio 10 months ago

enhancement good first issue

NanoNets/docext

1.7K

An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)

Python

能否同时启动多个模型？ 12 months ago

AI Summary: Determine if the docext toolkit supports concurrent execution of multiple models, specifically Nanonets-OCR-s for PDF to Markdown conversion and Qwen2.5-VL-7B-Instruct-AWQ for information extraction. This involves reviewing the docext documentation and potentially testing the concurrent model execution.

Complexity: 4/5

enhancement help wanted

NanoNets/docext

1.7K

An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)

Python