Examining a page

Pages are dictionaries

In PDFs, the main data structure is the dictionary, a key-value data structure much like a Python dict or attrdict. The major difference is that the keys can only be names, while values can be any type, including other dictionaries.

PDF dictionaries are represented as pikepdf.Dictionary, and names are of type pikepdf.Name. A page is just another dictionary, with a few required fields that give it special status as a page.

A pikepdf.Name that is, usually, an ASCII-encoded string beginning with “/” followed by a capital letter.

In [1]: from pikepdf import Pdf

In [2]: example = Pdf.open('../tests/resources/congress.pdf')

In [3]: page1 = example.pages[0]

In [4]: page1
Out[4]: 
<pikepdf.Dictionary(type_="/Page")({
  "/Contents": pikepdf.Stream(stream_dict={
      "/Length": 50
    }, data=<...>),
  "/MediaBox": [ 0, 0, 200, 304 ],
  "/Parent": <reference to /Pages>,
  "/Resources": {
    "/XObject": {
      "/Im0": pikepdf.Stream(stream_dict={
          "/BitsPerComponent": 8,
          "/ColorSpace": "/DeviceRGB",
          "/Filter": [ "/DCTDecode" ],
          "/Height": 1520,
          "/Length": 192956,
          "/Subtype": "/Image",
          "/Type": "/XObject",
          "/Width": 1000
        }, data=<...>)
    }
  },
  "/Type": "/Page"
})>

Item and attribute notation

Dictionary keys may be looked up using keys (page1['/MediaBox']) or attributes (page1.MediaBox). The two conventions are equivalent.

In [5]: page1.MediaBox
Out[5]: pikepdf.Array([ 0, 0, 200, 304 ])

In [6]: page1['/MediaBox']
Out[6]: pikepdf.Array([ 0, 0, 200, 304 ])

By convention, pikepdf uses attribute notation for keys in the PDF specification and item notation for internal names within a PDF. For example

In [7]: page1.Resources.XObject['/Im0']

Here '/Im0' is an arbitrary name generated by the program that produced this PDF, rather than a name in the specification like Resources and XObject. Item notation here would be quite cumbersome: ['/Resources']['/XObject]['/Im0'] (not recommended).

Attribute notation is convenient, but not robust if elements are missing. For elements that are not always present, you can use .get(), which behaves like dict.get() in core Python. A library such as glom might help when working with complex structured data that is not always present.

repr() output

Returning to the page’s output:

In [8]: page1
Out[8]: 
<pikepdf.Dictionary(type_="/Page")({
  "/Contents": pikepdf.Stream(stream_dict={
      "/Length": 50
    }, data=<...>),
  "/MediaBox": [ 0, 0, 200, 304 ],
  "/Parent": <reference to /Pages>,
  "/Resources": {
    "/XObject": {
      "/Im0": pikepdf.Stream(stream_dict={
          "/BitsPerComponent": 8,
          "/ColorSpace": "/DeviceRGB",
          "/Filter": [ "/DCTDecode" ],
          "/Height": 1520,
          "/Length": 192956,
          "/Subtype": "/Image",
          "/Type": "/XObject",
          "/Width": 1000
        }, data=<...>)
    }
  },
  "/Type": "/Page"
})>

The angle brackets in the output indicate that this object cannot be constructed with a Python expression because it contains a reference. When angle brackets are omitted from the repr() of a pikepdf object, then the object can be replicated with a Python expression, such as eval(repr(x)) == x.

In Jupyter and IPython, pikepdf will instead attempt to display a preview of the PDF page. An explicit repr(page) will show the text representation.

This page’s MediaBox is a direct object. The MediaBox describes the size of the page in PDF coordinates (1/72 inch multiplied by the value of the page’s /UserUnit, if present).

In [9]: import pikepdf

In [10]: page1.MediaBox
Out[10]: pikepdf.Array([ 0, 0, 200, 304 ])

In [11]: pikepdf.Array([ 0, 0, 200, 304 ])
Out[11]: pikepdf.Array([ 0, 0, 200, 304 ])

The page’s /Contents key contains instructions for drawing the page content. Also attached to this page is a /Resources dictionary, which contains a single XObject image. The image is compressed with the /DCTDecode filter, meaning it is encoded with the DCT, so it is a JPEG. [1]

[1]Without the JFIF header.

Viewing images

pikepdf provides a helper class PdfImage for manipulating PDF images.

In [12]: from pikepdf import PdfImage

In [13]: pdfimage = PdfImage(page1.Resources.XObject['/Im0'])

In [14]: pdfimage
Out[14]: <pikepdf.PdfImage image mode=RGB size=1000x1520 at 0x3ffa0325320>

In Jupyter (or IPython with a suitable configuration) the image will be displayed.

im0

You can also inspect the properties of the image. The parameters are similar to Pillow’s.

In [15]: pdfimage.colorspace
Out[15]: '/DeviceRGB'

In [16]: pdfimage.width, pdfimage.height
Out[16]: (1000, 1520)

Note

.width and .height are the resolution of the image in pixels, not the size of the image in page coordinates.

Extracting images

Extracting images is straightforward. extract_to() will extract images to streams, such as an open file. Where possible, extract_to writes compressed data directly to the stream without transcoding. The return value is the file extension that was extracted.

In [17]: pdfimage.extract_to(stream=open('file.jpg', 'w'))

You can also retrieve the image as a Pillow image:

In [18]: pdfimage.as_pil_image()

Note

This simple example PDF displays a single full page image. Some PDF creators will paint a page using multiple images, and features such as layers, transparency and image masks. Accessing the first image on a page is like an HTML parser that scans for the first <img src=""> tag it finds. A lot more could be happening. There can be multiple images drawn multiple times on a page, vector art, overdrawing, masking, and transparency. A set of resources can be grouped together in a “Form XObject” (not to be confused with a PDF Form), and drawn at all once. Images can be referenced by multiple pages.

Replacing an image

See test_image_access.py::test_image_replace.