PDF (Portable Document Format), the Chinese name of the portable document format is a file format that we often come into contact with, literature, documents … Many are PDF format. It has the advantage of a stable format, which allows us to print, share and transmit the process to optimally maintain the original color and format.
PDF is based on the PostScript language image model of a document format, it has a great advantage in terms of format stability though. However, in terms of editable for the user has introduced another problem.
For example, in the document split, merge, cut, convert, edit, etc. PDF is somewhat overstretched.
Adobe Reader, Foxit Reader, Panda PDF… PDF tools often used only for document reading, but the free version can not be used for document editing. Although, the web version of PDF tools, such as SmallPDF, I love PDF can be used for PDF editing, but there are restrictions on the size of the document.
Once, in order to replace a page in the PDF, I have tried almost all the mainstream PDF tools on the market, and finally had to choose to use paid tools to solve the problem.
After thinking about it, since these commercial software is not reliable, why not consider the development of a tool to do it yourself? Obviously dozens of lines of code to solve the problem, why do we have to go to so much trouble to download and install the software that does not have the decency?
This article will introduce the use of Python to easily develop a PDF editing tools can be used for PDF to TxT, split, merge, cut, convert.
PyPDF2 is a third-party python PDF library that can split, merge, crop and convert pages of PDF files.
In addition, it can add custom data, watermarks, passwords to PDF files, and also retrieve text and metadata from PDF files.
Installation
Use pip to install directly.
$ pip install PyPDF2
The following will demonstrate a few PDF editing features, and will explain the meaning of the code line by line.
Delete PDF pages
first give the code to achieve
from PyPDF2 import PdfFileWriter, PdfFileReader
output = PdfFileWriter () // 1
input1 = PdfFileReader(open(“example.pdf”, “rb”)) // 2
def delete_pdf(index):
pages = input1.getNumPages() // 3
for i in range(pages):
if i+1 in index:
continue
output.addPage(input1.getPage(i)) // 4
outputStream = open(“PyPDF2-output.pdf”, “wb”)
output.write(outputStream) // 5
delete_pdf([2,3,4])
the following to explain a few key points in the code.
declare an instance for the output of PDF.
read the local PDF document.
obtain the number of pages of the PDF document.
read the i-th page of the PDF, added to the output output instance.
save the edited document locally.
merge PDF
has been achieved to remove the PDF page, the next step is to see how to merge the pages in another PDF to the current PDF.
Method 1.
Can be expanded along the way to delete PDF pages in front of the way to merge PDF.
from PyPDF2 import PdfFileWriter, PdfFileReader
output = PdfFileWriter ()
input1 = PdfFileReader (open (“example.pdf”, “rb”))
input2 = PdfFileReader(open(“sample2.pdf”, “rb”)) // 1
def merge_pdf(add_index, origin_index):
pages = input1.getNumPages()
k = 0
for i in range(pages):
if i+1 in add_index:
output.addPage(input2.getPage(origin_index[k])) // 2
pages += 1
k += 1
output.addPage(input1.getPage(i))
outputStream = open(“PyPDF2-output.pdf”, “wb”)
output.write(outputStream)
merge_pdf([2, 3, 4], [0, 0, 0])
read the source documents that need to be merged.
traverse to the specified page, merge the pages of the source PDF.
Method 2.
In addition to method 1, there is another way to merge PDF.
from PyPDF2 import PdfFileMerger // 1
merger = PdfFileMerger()
input1 = open(“document1.pdf”, “rb”) // 2
input2 = open(“document2.pdf”, “rb”)
input3 = open(“document3.pdf”, “rb”)
merger.append(fileobj = input1, pages = (0,3)) // 3
merger.merge(position = 2, fileobj = input2, pages = (0,1)) // 4
merger.append(input3) // 5
output = open(“document-output.pdf”, “wb”)
merger.write(output)
import PyPDF2 merge module PdfFileMerger;
read the PDF documents that need to be processed and merged.
from the first PDF document to remove the need to merge the first three pages.
insert the first page of the second PDF document into the document.
attach the third PDF document to the end of the output document.
In addition to the 2 main features described above, PyPDF2 also has some other minor features.
Rotate
input1.getPage(1).rotateClockwise(90)
so that page 1 rotates 90 degrees.
Add watermark
page = input1.getPage(3)
watermark = PdfFileReader(open(“watermark.pdf”, “rb”))
page.mergePage(watermark.getPage(0))
which, the watermark is stored in another PDF document watermark.pdf.
encrypted
password = “secret”
output.encrypt(password)
first give a secret password, and then use encrypt to encrypt the output document.
pdfminer
introduced earlier PyPDF2 is mainly good at PDF page-level editing, and for text and source data level editing ability is weak.
So, here to introduce another Python library to make up for its shortcomings.
PDFMiner is a PDF document text extraction tool, it has the following features.
the ability to obtain accurate information on the location and layout of the text.
PDF can be converted to HTML/XML and other formats.
can extract the directory.
can extract the content of tags.
supports various font types (Type1, TrueType, Type3 and CID);
support for Chinese, Japanese, Korean languages and vertically written text;
Installation
$ pip install pdfminer
PDF to TxT
pdfminer in GitHub’s hosting project, in the directorytools under a number of useful toolset, for example, PDF to HTML, PDF to HTML, PDF to TXT. we can directly by using the following command to present the text information in the PDF document.
$ pdf2txt.py samples/simple1.pdf
Summary
Through the above 2 Python libraries, you can achieve from page to text metadata editing, this article only briefly introduced the basic usage of each item. For detailed usage and function lists, you can read the official documentation or read the project source code on GitHub to learn more. In addition, you can think out of the box to explore more valuable application scenarios based on these basic usages, for example, presenting text data and then calling the translation API for literature translation. Also, the software can be packaged and developed into a universal PDF editing tool.