Jina AI Reader tool can read PDF files from any URL and quickly parse them into text

Brain Titan
4 min readJun 9, 2024

--

Jina AI announced that its Reader tool can now read PDF files from any URL and quickly parse them into text for use by downstream language models (LLMs).

Just add the URL of the PDF to http://r.jina.aithe prefix, as in this example , (https://r.jina.ai/https://www.nasa.gov/wp-content/uploads/2023/01/55583main_vision_space_exploration2.pdf) to get the parsed text for use by the downstream language model (LLM). Reader natively supports PDF reading, is compatible with most PDF files, including those with a large number of images, and parses very quickly!

Previously, the tool’s PDF support was limited to arXiv and relied on its HTML version. Parsing PDFs is a complex process that requires rendering a URL to confirm whether it is a PDF, and converting it to legible text usually requires OCR technology. Now, Jina Reader provides this free new feature, improving LLM’s text processing capabilities.

  • Jina AI Reader now supports reading any PDF from any URL.
  • Simply add the URL of the PDF to get the parsed text for downstream LLM use.
  • Reader natively supports PDF reading, including PDFs with a lot of images, and it is extremely fast.
  • Previously PDF support was limited to arXiv and relied on the HTML version provided by arXiv.
  • Parsing PDF correctly is not easy, the URL needs to be rendered to determine if it is a PDF.
  • PDF is designed for printing and is not suitable for direct sub-processing, and conversion to clean text usually requires OCR.
  • This new feature is now available in Jina Reader for free.
  1. Difficulty of judging PDF by URL :
  • It is unreliable to judge whether a URL is a PDF simply by whether it ends with “.pdf”.
  • Some URLs look like PDFs but are not, and some don’t, such as the link to arXiv ( example link ), which does not end in “.pdf” but returns a PDF.
  • Therefore, you need to render the URL first and handle it accordingly. Since browsers cannot render PDF content natively, you need to use a tool like pdf.js to render the page.

2. Complexity of PDF :

  • Many people forget that PDF was designed for printing, not for subprocessing.
  • Images, text, and tables in a PDF are each in their own layer, with no connection, and simply appear in specific locations to present the final layout.
  • An analogy for this is a bunch of elements in HTML <div>, each defined by an absolute position of top, left, right, and bottom.
  • Converting them into clean, LLM-friendly text usually requires using OCR to recognize the images, similar to converting a scanned paper book into electronic text.

Detailed steps for Jina AI Reader to read any PDF

Prepare the PDF URL :

Add the URL to Jina Reader :

Parsing PDF :

  • Jina Reader will automatically parse the URL you provide and extract the content from it. This includes processing images, text, tables, etc.
  • Since it is impossible to determine whether it is a PDF just by the URL, Jina Reader uses pdf.jsto render the page so as to accurately parse the content.

View the analysis results :

  • Once parsing is complete, you can view the extracted text content, which has been processed and is suitable for downstream language model (LLM) use.

Handling the special case of embedded PDFs :

  • If multiple PDFs are embedded in a web page or PDF is embedded in HTML, Jina Reader can also process and parse these contents correctly.

Dealing with complex PDF formats :

  • For PDFs containing a large number of images or complex layouts, Jina Reader uses OCR technology to recognize text in images to ensure the integrity and accuracy of the content.

Use the parsed text :

  • The parsed text can be used in your language model, data analysis or other downstream applications. These texts are optimized for further processing and use.

Jina AI Reader: https://jina.ai/reader/

--

--

No responses yet