A bit like Gigningest Of which I was talking to you the last time, which allows you to take a git deposit and make it a buffable version by the LLMS (AI), I would like to present to you today Docling.
The concept is almost identical except that A allows formats like the PDF, Word, PowerPoint, Excel, Images, HTML, ASCIIDOC, Markdown & Mldr; in HTML, Markdown or JSON according to your needs. And the strongest is that it even retains the images, whether integrated or referenced.
Now what makes Docling special is his ability to analyze intelligently the structure of documents. Take a PDF for example: Instead of swinging a raw text block without tail or head, Docling automatically detects:
- The layout and the reading order
- The structure of the tables
- Titles and subtitles
- Metadata (authors, references, language & mldr;)
- Distinct elements such as headers and feet
And if you develop AI -based applications, Docling will be able to integrate perfectly with popular frameworks like Langchain,, Llamandex,, CREW AI And Haystack. No need to tinker for hours to connect your tools! There are many concrete examples of integration into the official documentation.
And installation is child’s play:
pip install docling
And for use, it’s just as simple:
from docling.document_converter import DocumentConverter
source = "https://arxiv.org/pdf/2408.09869"
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())
In addition, Docling is not content to stupidly convert your documents that it takes really practical features like:
- Text recognition (OCR) for scanned pdfs
- The extraction of mathematical equations
- Source code detection
- A command line interface for quick uses
- Multi-platform support (Windows, Mac, Linux, X86_64 and Arm64)
Developed by IBM, Docling is open source under MIT license and updates are regular and constantly bring new features.
Do not hesitate to test for yourself because this tool which could well become essential in your Dev toolbox
Source link
Subscribe to our email newsletter to get the latest posts delivered right to your email.
Comments