Technical Articles

PDF To Grid

21 Apr 2014

What is PDFGrid

PDFGrid is a ongoing project that extracts automatically grid-like or tabular data from a text based PDF. The algorithm is made so that in a later phase the output of a OCR can be used to detect grid-like data. The output of PDFGrid can be given to the DSL Parser to further clean, transform, merge etc, the data.

Since the project is still a play ground to test several ideas, the code is not clean. Therefore, the code is undergoing a major clean up/upgrade, but the idea’s behind the extraction are the same.

Example

This pdf contains a fairly easy extraction.

alt text

This is a part of the extracted data. Note that the line with DEC is not part of the output. This is because the algorithm considers the line NOV and DEC too different. Future evolutions of the project will improve on that.

>>-------------
JAN(P)	|8,631	|14,714	|7,884	|907,261  |907,134 |1,253,068	|72.4	|6,099.0
FEB(P)	|7,858	|11,130	|7,069	|804,577  |796,192 |1,123,492	|70.9	|5,715.0
MAR(P)	|8,624	|14,518	|7,758	|907,432  |912,081 |1,232,963	|74.0	|5,903.0
APR(P)	|8,331	|13,998	|7,670	|852,743  |862,694 |1,219,041	|70.8	|5,838.0
MAY(P)	|8,648	|14,547	|8,039	|933,573  |948,498 |1,277,627	|74.2	|6,134.0
JUNE(P)	|8,279	|14,232	|7,367	|809,681  |830,542 |1,170,875	|70.9	|6,165.0
JULY(R)	|8,562	|14,860	|7,725	|803,943  |822,897 |1,227,813	|67.0	|7,234.0
AUG(P)	|8,547	|14,666	|7,756	|908,224  |911,328 |1,232,691	|73.9	|6,502.0
SEP(P)	|8,273	|14,069	|7,440	|809,330  |811,833 |1,182,505	|68.7	|7,023.0
OCT	|8,609	|14,585	|7,657	|799,041  |805,016 |1,216,975	|66.1	|7,373.0
NOV	|8,498	|14,643	|7,492	|861,128  |849,822 |1,204,984	|70.5	|6,793.0
<<-------------

How does it work

The extraction is only based on the position of the characters. This approach has advantages (extract from layout without grid lines) as well as disadvantages (see previous examples, where grid lines could give additional information).

Text extraction

The characters and their positions are extracted from the pdf file, using to a slightly modified apache pdfbox library. The result is a list of characters and their bounding boxes.

325.37363	335.49146	175.48999	172.39459	(%)
664.06354	674.18134	175.48999	172.39459	(%)
65.64	        87.380394	188.93	        185.8346	JAN(P)
105.14	        121.75882	188.81	        185.7146	8,631
135.98	        156.3146	188.81	        185.7146	14,714

(For the future, I hope to process pdf/images as well by using an OCR that delivers the similar output)

Horizontal alignment

The elements are sorted on horizontal position to determine the lines.

37	554.5699	576.21515	175.01001	171.91461	----------
37	664.06354	674.18134	175.48999	172.39459	(%)
38	65.64	        87.380394	188.93	        185.8346	JAN(P)
38	105.14	        121.75882	188.81	        185.7146	8,631
38	135.98	        156.3146	188.81	        185.7146	14,714

Grouping characters

Within each line, the groups of characters are sorted on vertical position. Based on the font type, the width of the white space separator is determined. Based on the white space width, characters are grouped. There is a special case where the white space is used as thousand separator (example 1 234 567). The grouping does not have to be perfect, but can solve a number of edge cases in later processing.

Similar lines

The main idea of the grid extraction is based on the similarity of consecutive lines. A key metric of the similarity is based on the vertical aligments (left, center and right) of words.

Vertical Alignment

The grouped characters are ‘clustered’ according the left alignment (simply by taking the left point of the bounding rectangle of the group chars). Within a given cluster, the grouped characters are overlapping, but 2 different clusters do not overlap at all.

This process is repeated for the right and center aligned clusters. The three clusters are merged into one and a ‘alignment matrix’ is constructed. This matrix will be used as metric for the similarity.

Similarity Between 2 Lines

The similarity is based on several metrics

This results in a single number that determines if 2 lines are similar.