banner
andrewji8

Being towards death

Heed not to the tree-rustling and leaf-lashing rain, Why not stroll along, whistle and sing under its rein. Lighter and better suited than horses are straw sandals and a bamboo staff, Who's afraid? A palm-leaf plaited cape provides enough to misty weather in life sustain. A thorny spring breeze sobers up the spirit, I feel a slight chill, The setting sun over the mountain offers greetings still. Looking back over the bleak passage survived, The return in time Shall not be affected by windswept rain or shine.
telegram
twitter
github

Python-Camelot: Extract PDF table data in three lines of code.

PDF files are a very commonly used file format, typically used for formal electronic documents. It is able to fix different layout formats, creating clear and visually appealing display effects. However, for people who want to extract information from PDFs, especially tables, it can be a nightmare.

A large number of academic reports, papers, and analytical articles use PDF to display table data, but it can be very difficult to directly copy data from tables. Recently, a developer provided a tool called Camelot that can extract table information from text PDFs. It can directly convert most tables into Pandas Dataframes.

Project address: https://github.com/camelot-dev/camelot

What is Camelot?

According to the project introduction, Camelot is a Python tool used to extract table data from PDF files.

Specifically, users can open PDF files like they would with Pandas and then use this tool to extract table data, and finally specify the output format (such as a CSV file).

Code example

The project provides a PDF file as shown in the image, assuming that the user needs to extract information from table 2-1 between the text.

Using Camelot to extract table data, the code is as follows:

>>> import camelot
>>> tables = camelot.read_pdf('foo.pdf') # similar to opening a CSV file with Pandas
>>> tables[0].df # get a pandas DataFrame!
>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, sqlite, specify output format
>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_sqlite, export data to a file
>>> tables
<TableList n=1>
>>> tables[0]
<Table shape=(7, 7)> # get the output format
>>> tables[0].parsing_report
{
    'accuracy': 99.02,
    'whitespace': 12.24,
    'order': 1,
    'page': 1
}

The following is the output result. For merged cells, Camelot handles them by inserting blank lines after extraction, which is a reliable method.

Installation method

The project author provides three installation methods. First, you can use Conda for installation, which is the simplest method.

conda install -c conda-forge camelot-py
The most popular installation method is using pip.

pip install camelot-py[cv]
You can also clone the code from the project and install it using the source code.

git clone https://www.github.com/camelot-dev/camelot
cd camelot
pip install ".[cv]"

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.