What is PDFGrid
PDFGrid is a ongoing project that extracts automatically grid-like or tabular data from a text based PDF. The algorithm is made so that in a later phase the output of a OCR can be used to detect grid-like data. The output of PDFGrid can be given to the DSL Parser to further clean, transform, merge etc, the data.
Since the project is still a play ground to test several ideas, the code is not clean. Therefore, the code is undergoing a major clean up/upgrade, but the idea’s behind the extraction are the same.
Example
This pdf contains a fairly easy extraction.
This is a part of the extracted data. Note that the line with DEC is not part of the output. This is because the algorithm considers the line NOV and DEC too different. Future evolutions of the project will improve on that.
>>-------------
JAN(P) |8,631 |14,714 |7,884 |907,261 |907,134 |1,253,068 |72.4 |6,099.0
FEB(P) |7,858 |11,130 |7,069 |804,577 |796,192 |1,123,492 |70.9 |5,715.0
MAR(P) |8,624 |14,518 |7,758 |907,432 |912,081 |1,232,963 |74.0 |5,903.0
APR(P) |8,331 |13,998 |7,670 |852,743 |862,694 |1,219,041 |70.8 |5,838.0
MAY(P) |8,648 |14,547 |8,039 |933,573 |948,498 |1,277,627 |74.2 |6,134.0
JUNE(P) |8,279 |14,232 |7,367 |809,681 |830,542 |1,170,875 |70.9 |6,165.0
JULY(R) |8,562 |14,860 |7,725 |803,943 |822,897 |1,227,813 |67.0 |7,234.0
AUG(P) |8,547 |14,666 |7,756 |908,224 |911,328 |1,232,691 |73.9 |6,502.0
SEP(P) |8,273 |14,069 |7,440 |809,330 |811,833 |1,182,505 |68.7 |7,023.0
OCT |8,609 |14,585 |7,657 |799,041 |805,016 |1,216,975 |66.1 |7,373.0
NOV |8,498 |14,643 |7,492 |861,128 |849,822 |1,204,984 |70.5 |6,793.0
<<-------------
How does it work
The extraction is only based on the position of the characters. This approach has advantages (extract from layout without grid lines) as well as disadvantages (see previous examples, where grid lines could give additional information).
Text extraction
The characters and their positions are extracted from the pdf file, using to a slightly modified apache pdfbox library. The result is a list of characters and their bounding boxes.
325.37363 335.49146 175.48999 172.39459 (%)
664.06354 674.18134 175.48999 172.39459 (%)
65.64 87.380394 188.93 185.8346 JAN(P)
105.14 121.75882 188.81 185.7146 8,631
135.98 156.3146 188.81 185.7146 14,714
(For the future, I hope to process pdf/images as well by using an OCR that delivers the similar output)
Horizontal alignment
The elements are sorted on horizontal position to determine the lines.
37 554.5699 576.21515 175.01001 171.91461 ----------
37 664.06354 674.18134 175.48999 172.39459 (%)
38 65.64 87.380394 188.93 185.8346 JAN(P)
38 105.14 121.75882 188.81 185.7146 8,631
38 135.98 156.3146 188.81 185.7146 14,714
Grouping characters
Within each line, the groups of characters are sorted on vertical position. Based on the font type, the width of the white space separator is determined. Based on the white space width, characters are grouped. There is a special case where the white space is used as thousand separator (example 1 234 567). The grouping does not have to be perfect, but can solve a number of edge cases in later processing.
Similar lines
The main idea of the grid extraction is based on the similarity of consecutive lines. A key metric of the similarity is based on the vertical aligments (left, center and right) of words.
Vertical Alignment
The grouped characters are ‘clustered’ according the left alignment (simply by taking the left point of the bounding rectangle of the group chars). Within a given cluster, the grouped characters are overlapping, but 2 different clusters do not overlap at all.
This process is repeated for the right and center aligned clusters. The three clusters are merged into one and a ‘alignment matrix’ is constructed. This matrix will be used as metric for the similarity.
Similarity Between 2 Lines
The similarity is based on several metrics
- Number of white spans: tabular data tends to have the same number of white spans
- Number of aligned white spans: tabular data tends to have the a large number of aligned white spans
- Font Similarity
- Relative alignments: this metric is coming the alignment matrix. If the line belongs to a ‘large’ group of aligned words, then the metric will be important. (This metric takes more than the 2 lines into account)
- Pixels: relative increase of ‘black’ space.
This results in a single number that determines if 2 lines are similar.