PyPDF2

PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs as well as merge entire files together.

Homepage: http://mstamy2.github.io/PyPDF2/

The Destination Class

class PyPDF2.Destination(title, page, typ, *args)

A class representing a destination within a PDF file. See section 8.2.1 of the PDF 1.6 reference.

Destination.bottom

Read-only property accessing the bottom vertical coordinate.

Destination.left

Read-only property accessing the left horizontal coordinate.

Destination.page

Read-only property accessing the destination page.

Destination.right

Read-only property accessing the right horizontal coordinate.

Destination.title

Read-only property accessing the destination title.

Destination.top

Read-only property accessing the top vertical coordinate.

Destination.typ

Read-only property accessing the destination type.

Destination.zoom

Read-only property accessing the zoom factor.

The DocumentInformation Class

class PyPDF2.DocumentInformation

A class representing the basic document metadata provided in a PDF File.

DocumentInformation.author

Read-only property accessing the document’s author.

DocumentInformation.creator

Read-only property accessing the document’s creator. If the document was converted to PDF from another format, the name of the application (for example, OpenOffice) that created the original document from which it was converted.

DocumentInformation.producer

Read-only property accessing the document’s producer. If the document was converted to PDF from another format, the name of the application (for example, OSX Quartz) that converted it to PDF.

DocumentInformation.subject

Read-only property accessing the subject of the document.

DocumentInformation.title

Read-only property accessing the document’s title.

The PageObject Class

class PyPDF2.PageObject(pdf)

This class represents a single page within a PDF file. Typically this object will be created by accessing the PdfFileReader.getPage() function of the PdfFileReader class.

PageObject.artBox

A rectangle (RectangleObject), expressed in default user space units, defining the extent of the page’s meaningful content as intended by the page’s creator.

PageObject.bleedBox

A rectangle (RectangleObject), expressed in default user space units, defining the region to which the contents of the page should be clipped when output in a production enviroment.

PageObject.compressContentStreams()

Compresses the size of this page by joining all content streams and applying a FlateDecode filter.

PageObject.cropBox

A rectangle (RectangleObject), expressed in default user space units, defining the visible region of default user space. When the page is displayed or printed, its contents are to be clipped (cropped) to this rectangle and then imposed on the output medium in some implementation-defined manner. Default value: same as MediaBox.

PageObject.extractText()

Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated.

PageObject.mediaBox

A rectangle (RectangleObject), expressed in default user space units, defining the boundaries of the physical medium on which the page is intended to be displayed or printed.

PageObject.mergePage(page2)

Merges the content streams of two pages into one. Resource references (i.e. fonts) are maintained from both pages. The mediabox/cropbox/etc of this page are not altered. The parameter page’s content stream will be added to the end of this page’s content stream, meaning that it will be drawn after, or “on top” of this page.

PageObject.rotateClockwise(angle)

Rotates a page clockwise by increments of 90 degrees.

PageObject.rotateCounterClockwise(angle)

Rotates a page counter-clockwise by increments of 90 degrees.

PageObject.trimBox

A rectangle (RectangleObject), expressed in default user space units, defining the intended dimensions of the finished page after trimming.

The PdfFileReader Class

class PyPDF2.PdfFileReader(stream)

Initializes a PdfFileReader object. This operation can take some time, as the PDF stream’s cross-reference tables are read into memory.

PdfFileReader.decrypt(password)

When using an encrypted / secured PDF file with the PDF Standard encryption handler, this function will allow the file to be decrypted. It checks the given password against the document’s user password and owner password, and then stores the resulting decryption key if either password is correct.

It does not matter which password was matched. Both passwords provide the correct decryption key that will allow the document to be used with this library.

PdfFileReader.documentInfo

Read-only property that accesses the PdfFileReader.getDocumentInfo() function.

PdfFileReader.getDocumentInfo()

Retrieves the PDF file’s document information dictionary, if it exists. Note that some PDF files use metadata streams instead of docinfo dictionaries, and these metadata streams will not be accessed by this function.

PdfFileReader.getNamedDestinations(tree=None, retval=None)

Retrieves the named destinations present in the document.

PdfFileReader.getNumPages()

Calculates the number of pages in this PDF file.

PdfFileReader.getOutlines(node=None, outlines=None)

Retrieves the document outline present in the document.

PdfFileReader.getPage(pageNumber)

Retrieves a page by number from this PDF file.

PdfFileReader.isEncrypted

Read-only boolean property showing whether this PDF file is encrypted. Note that this property, if true, will remain true even after the PdfFileReader.decrypt() function is called.

PdfFileReader.namedDestinations

Read-only property that accesses the PdfFileReader.getNamedDestinations() function.

PdfFileReader.numPages

Read-only property that accesses the PdfFileReader.getNumPages() function.

PdfFileReader.outlines

Read-only property that accesses the PdfFileReader.getOutlines() function.

PdfFileReader.pages

Read-only property that emulates a list based upon the PdfFileReader.getNumPages() and PdfFileReader.getPage() functions.

The PdfFileWriter Class

class PyPDF2.PdfFileWriter

This class supports writing PDF files out, given pages produced by another class (typically PdfFileReader).

PdfFileWriter.addPage(page)

Adds a page to this PDF file. The page is usually acquired from a PdfFileReader instance.

PdfFileWriter.encrpyt(user_pwd, owner_pwd=None, use_128bit=True)

Encrypt this PDF file with the PDF Standard encryption handler.

PdfFileWriter.write(stream)

Writes the collection of pages added to this object out as a PDF file.