PDF files are a very commonly used file format, typically used for formal electronic documents. It is able to fix different layout formats, creating clear and visually appealing display effects. However, for people who want to extract information from PDFs, especially tables, it can be a nightmare.
A large number of academic reports, papers, and analytical articles use PDF to display table data, but it can be very difficult to directly copy data from tables. Recently, a developer provided a tool called Camelot that can extract table information from text PDFs. It can directly convert most tables into Pandas Dataframes.
Project address: https://github.com/camelot-dev/camelot
What is Camelot?
According to the project introduction, Camelot is a Python tool used to extract table data from PDF files.
Specifically, users can open PDF files like they would with Pandas and then use this tool to extract table data, and finally specify the output format (such as a CSV file).
Code example
The project provides a PDF file as shown in the image, assuming that the user needs to extract information from table 2-1 between the text.
Using Camelot to extract table data, the code is as follows:
>>> import camelot
>>> tables = camelot.read_pdf('foo.pdf') # similar to opening a CSV file with Pandas
>>> tables[0].df # get a pandas DataFrame!
>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, sqlite, specify output format
>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_sqlite, export data to a file
>>> tables
<TableList n=1>
>>> tables[0]
<Table shape=(7, 7)> # get the output format
>>> tables[0].parsing_report
{
'accuracy': 99.02,
'whitespace': 12.24,
'order': 1,
'page': 1
}
The following is the output result. For merged cells, Camelot handles them by inserting blank lines after extraction, which is a reliable method.
Installation method
The project author provides three installation methods. First, you can use Conda for installation, which is the simplest method.
conda install -c conda-forge camelot-py
The most popular installation method is using pip.
pip install camelot-py[cv]
You can also clone the code from the project and install it using the source code.
git clone https://www.github.com/camelot-dev/camelot
cd camelot
pip install ".[cv]"