Help:Converting PDF to DjVu

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

Converting PDF files to DjVu is not always necessary, but may be useful because of the advantages of DjVu format and the problems that some PDF files have (e.g. due to incompatible text image layering, imbedded jpeg images, fonts or other things). Each PDF should be evaluated on its own merits before a decision to convert is made. Vector PDF files (for example, those created from converting a digital original, such as a document created in Open Office or Microsoft Word) should not normally be converted; these are common with recent government documents. Having DjVu versions of PDF books is beneficial because:

  • DjVu file are less in size which was the main idea behind the DjVu format;
  • DjVu documents are more easily rendered and scrolled around. You can notice that in PDF and DjVu viewers: DjVu has better response time;
  • DjVu format doesn't require fonts;
  • DjVu can have an indexed, searchable outline in WinDjView. See Help:Creating an outline for PDF and DjVu.

However, such a conversion tends to reduce quality, so we should do what's possible to keep the acceptable quality.

Improving the Windows command line[edit]

The conversion may require command line tools. The standard Windows command shell is enough for the task, but it can be improved to make it easier to use. See Help:Improving the Windows command shell.

Using ready software solutions[edit]

  • pdf2djvu can convert easily a PDF file to a DjVu file. You can use this command line to make it:
    pdf2djvu -o p.djvu --dpi=900 p.pdf
  • Other software for converting PDF to DjVu exists, like Celartem pdftodjvu. However, it has been found to crash on certain files under some conditions.

Usage: pdftodjvu book.pdf [-o book.djvu] [-mode:Document|Bitonal] [other parameters...]

Parameters:

-o book.djvu
Output djvu file. If -о is not specified, pdftodjvu uses the name of the input PDF-file with the '.djvu' extension.
-mode SegmentationMode
Sets the so-called segmentation mode, which can be one of: Document, Bitonal, PhotoIfFGEmpty, SegmentAlways, PhotoAlways.

Acceptable results are typically only obtained with Document or Bitonal values for mode.

-dpi DPI
Sets the DPI.

The parameters are explained in detail upon a simple launch of this program.

  • Some software cannot directly convert Google Books files because of their compression. In that case, the conversion can be done be extracting pages in a bitmap format and assembling them into DjVu (see below). Programming challenge here: it is desirable that someone wrote a converter that could convert even Google Books PDF files to DjVu.


Splitting into images and assembling them into DjVu[edit]

If the satisfactory results can't be achieved with the existing converters, the only working method may be to extract all pages of the original file as images and create a DjVu out of them. Different approaches are required for coloured and bitonal PDF files.

Step 1: Extracting the pages as images[edit]

The first step is to extract PDF pages as images. The format of the images is important, because the programs that will convert them to DjVu require JPEG or PNM for coloured documents, and TIFF or PBM for bitonal documents.

  • The GUI program PDF-XChange Viewer can extract into the required JPG, TIFF or PBM formats, and STDU Viewer supports JPG extraction.
  • The advantage of PDF-XChange Viewer is that it can also smooth the whole pages graphically and extract the smoothed images, which may be useful because in some scanned books the text looks rough. (The smoothing is enabled in Preferences → Rendering → Smooth images.)
  • Another advantage is that it can extract the whole book into one multi-page bitonal TIFF file, which can be later converted with minidjvu (step 3 below).
  • If the desired format is TIFF, it needs to be set to bitonal: you need to open the export window (File → Export → Export to image...), then select Options → Image type → 1 (Black & white).
  • To get a multi-page TIFF, Export mode in the "Export to image" window needs to be set accordingly.
  • The command line tool pdfimages can give PBM/PPM and, in case the images are internally stored in the JPEG format, it can extract them as JPEG files.

Step 2: Converting (editing) the images[edit]

The images are extracted with the scope of creating a DjVu out of them. The problem is there will likely be a need to convert them to the format that is accepted by the program that creates DjVu out of images, which is JPEG or PNM for coloured documents, and TIFF or PBM for bitonal documents. See Help:Converting images for details.

  • If ImageMagick convert is used for conversion, images can also optionally be edited along with the conversion.
  • Smoothing the text. ImageMagick convert -blur option with a value of about 0.05 to 0.3 can be used to smooth the text. See Help:Smoothing text in PDF or DjVu scanned books.
  • Reducing the dimensions of the images to obtain smaller DjVu size. This may be done with Imagemagick convert -resize option.

Step 3: Assembling the images into DjVu[edit]

Using online services[edit]

  • Any2DjVu Server allows to directly convert a PDF file.
  • PDF files from the rich French online library Gallica can be fetched simply by giving their FTP address to Any2DjVu Server. However, it is often advisable to crop the PDF file to have a better placement of the page in the frame. This may prove to be a challenge. Someone familiar with the subject is welcome to edit this article and describe this process.
  • Other online services exist that claim the ability to convert PDF to DjVu, but they may have limitations and not always deliver the expected results. You can easily find them by searching for convert pdf to djvu online (or the same in your native language).

Transferring the outline[edit]

Use HandyOutliner or Pdf & DjVu Bookmarker to transfer the outline (table of contents) to the converted file. See Help:Creating an outline for PDF and DjVu.

Leaving requests[edit]

If you wish that some document on Commons was converted, but can't do it yourself, you may leave a request in the "Commons requests" category page.

See also[edit]