Press ESC to close

Paperless: Support for Office files with Gotenberg & Tika

Hello!

Today I’m happy to show you a simple way to configure the ability to add files in Office formats to Paperless-ngx, such as .doc, .xlsx or .odt. Discover how to quickly and efficiently extend the functionality of your Paperless, allowing you to easily upload and manage documents in popular office formats.

Introduction

What is Paperless-ngx?

Icon: Paperless-ngxPaperless-ngx is a modern open source solution for paperless document management. It is a fork of the Paperless project, which is designed to scan, tag, search and manage digital copies of paper documents to minimize the need to store physical copies. Paperless-ngx offers a variety of improvements over the original design, including better user interface support, more advanced search options, automatic tagging of documents based on their content, and OCR (optical character recognition) support in multiple languages, allowing for easier management and retrieval of documents in the database.

This project is particularly useful for individuals and companies seeking to reduce the amount of paper in their work and daily lives. It offers a simple and efficient way to organize digital documents.

Genesis of the issue

To start with, by default Paperless-ngx provides functionality for uploading .pdf files. However, to extend the system’s capabilities, it is worth configuring Gotenberg support. Gotenberg is a tool that allows you to convert documents in various formats to PDF using HTTP requests. In addition, the configuration of Apache Tika support will allow users to analyze the contents of files in office formats such as .doc, .xlsx, or .odt, making it possible to post-process them, such as converting them to PDF using Gotenberg. It’s worth noting that despite the conversion, users will still be able to retrieve documents from the archive in their original formats, retaining full flexibility in managing documentation resources.

Step 1 – Installing Paperless-ngx on a Synology server

If you have not yet installed Paperless-ngx on your Synology device, I strongly encourage you to read this article. You will learn from it how to set up a paperless document management system. This is a great opportunity to start taking advantage of Paperless-ngx’s advanced capabilities for easy organization and secure document storage.

Step 2 – Edit parameters in Docker Compose

In step 2, we’ll look at editing the parameters in the Docker Compose file to enable advanced options in Paperless-ngx. This is a key part of the setup, allowing the process to be tailored to specific needs, including processing documents in Office formats.

  • Log in to your account and go to the administrative interface of the Portainer.
    • If you are using the Authentik single sign-on system, you can make the login process easier by integrating Portainer with Authentik. For integration instructions, see the dedicated guide.
  • Select your environment (eng: Environments) in which you will edit the configuration of the created Docker Compose. Then go to Stacks.

Guide to installing Traccar on Synology in Docker

  • Select the created Stack named paperless-ngx (or named according to you, corresponding to Paperless).

Gotenberg and Tika

  • Add the following Docker Compose code to your existing stack to configure Gotenberg and Tika:
  gotenberg:
    image: gotenberg/gotenberg:7.9.2
    restart: always
    container_name: paperless_gotenberg
    networks:
      - paperless_network
    command:
      - "gotenberg"
      - "--log-level=debug"
      - "--chromium-disable-javascript=true"
      - "--chromium-disable-routes=true"
      - "--chromium-allow-list=file:///tmp/.*"
      - "--api-timeout=600s"

  tika:
    image: ghcr.io/paperless-ngx/tika
    container_name: paperless_tika
    networks:
      - paperless_network
    restart: always

For the Gotenberg portion of the application, I’m using the Docker image version 7.9.2 because I’ve been noticing a lot of reported bugs related to converting documents to PDF that have appeared in the GitHub repository. This version in operation with Paperless proves to be the most stable. In the command block, we configure Gotenberg by specifying the login level, disabling JavaScript in Chromium, disabling some routes, setting a list of allowed URLs for Chromium, and setting the time limit for API response to 600 seconds. This configuration will ensure optimal performance and stability of Gotenberg in your environment.

Paperless-ngx

  • Add a parameter to Environment at the Paperless-ngx Web server as described below:
      PAPERLESS_TIKA_ENABLED: 1
      PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000
      PAPERLESS_TIKA_ENDPOINT: http://tika:9998
  • Click Deploy the stack, then wait until Portainer processes the new content and creates the containers again.
  • Done! 🚀

Step 3 – Upload the file to Paperless

To test the service, select the file you want to upload to Paperless.

  • Go to the Paperless panel and use the file upload option.
  • Once the file has been successfully uploaded, monitor the system’s interaction with the file to confirm that the service is working correctly.
  • The logs should show the process of reading the contents of the document. Example below:
[2024-05-07 18:42:03,800] [DEBUG] [paperless.consumer] Detected mime type: application/vnd.openxmlformats-officedocument.wordprocessingml.document
[2024-05-07 18:42:04,142] [DEBUG] [paperless.consumer] Parser: TikaDocumentParser
[2024-05-07 18:42:04,147] [DEBUG] [paperless.consumer] Parsing 2024-04-06 XXXX.docx...
[2024-05-07 18:42:04,148] [INFO] [paperless.parsing.tika] Sending /tmp/paperless/paperless-ngx97r_6y6x/2024-04-06 XXXX.docx to Tika server
[2024-05-07 18:42:06,377] [INFO] [paperless.parsing.tika] Converting /tmp/paperless/paperless-ngx97r_6y6x/2024-04-06 XXXX.docx to PDF as /tmp/paperless/paperless-e0xc5cx4/convert.pdf
[2024-05-07 18:42:11,666] [DEBUG] [paperless.consumer] Generating thumbnail for 2024-04-06 XXXX.docx...
[2024-05-07 18:42:11,673] [DEBUG] [paperless.parsing] Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient -define pdf:use-cropbox=true /tmp/paperless/paperless-e0xc5cx4/convert.pdf[0] /tmp/paperless/paperless-e0xc5cx4/convert.webp
[2024-05-07 18:42:13,766] [INFO] [paperless.parsing] convert exited 0
[2024-05-07 18:42:13,984] [DEBUG] [paperless.consumer] Saving record to database
[2024-05-07 18:42:13,985] [DEBUG] [paperless.consumer] Creation date from parse_date: 2024-04-06 18:29:00+00:00
[2024-05-07 18:42:14,120] [INFO] [paperless.handlers] Assigning document type Filip / Sprawy publiczne, to 2024-04-06 2024-04-06 XXXX.docx
[2024-05-07 18:42:14,177] [INFO] [paperless.handlers] Assigning storage path Sprawy publiczne: 2024 Warszawa, Zarząd Dróg Miejskich
[2024-05-07 18:42:14,592] [DEBUG] [paperless.consumer] Deleting file /tmp/paperless/paperless-ngx97r_6y6x/2024-04-06 XXXX.docx
[2024-05-07 18:42:14,605] [DEBUG] [paperless.parsing.tika] Deleting directory /tmp/paperless/paperless-e0xc5cx4
[2024-05-07 18:42:14,607] [INFO] [paperless.consumer] Document 2024-04-06 XXXX consumption finished
[2024-05-07 18:42:14,627] [INFO] [paperless.tasks] ConsumeTaskPlugin completed with: Success. New document id 4151 created

If you have additional questions about the setup, go ahead and leave a comment under this article or contact me directly. I will be happy to answer any concerns and help solve any problems. Your questions can help improve this guide for other users.

Additional sources and information

Configure Single Sign-On (SSO) between Authentik and Paperless-ngx using OpenID Connect to increase the convenience of logging in and security of access to your document management system. Learn the steps needed to integrate these two powerful tools and enjoy a smoother authentication process.

👉 Learn more about the process and make your login management easier.

Discover how to resolve digital signature error in PDF files.

👉 Go to the article that will show you how to deal with the message: “DigitalSignatureError”.

For further exploration and more information, I recommend checking out the links below. They are valuable sources that were used in the development of this guide.

Read also

Filip Chochół

Filip Chochol runs two blogs: personal “chochol.io” and together with his girlfriend “Warsaw Travelers” about travel. He specializes in IT resource management and technical support, and has been active in the field of cyber security awareness for almost two years. A proponent of open-source technologies, he previously worked in the film and television industry in the camera division (2013-2021). After hours, he develops interests in smart homes and networking.

Leave a Reply

Your email address will not be published. Required fields are marked *


This site uses Akismet to reduce spam. Learn how your comment data is processed.