Background

After refreshing Tika's regression corpus (see CommonCrawl3 and TIKA-2750), we thought it might be interesting to run a comparison between the text extracted with pdftotext and Tika/PDFBox. Given that pdftotext does not extract content from embedded files and given that it does not perform Optical Character Recognition (OCR) or offer integration with OCR, this evaluation focused only on extracting electronic text as stored within PDFs. The goals of this study include:

Identify areas for improvements for PDFBox and pdftotext 2. Identify areas for improvements for tika-eval

The reader should not read the following as a recommendation for one tool over another.

The tika-eval reports and the full H2 database of comparison results are available here: http://162.242.228.174/pdf_parsing/pdftotextVPDFBox_201811/.

NOTE: On December 11, 2018, I reran the analysis with the pdftotext commandline option -cfg to specify the (font) configuration file. During the first run and the initial analysis, pdftotext was not applying the custom fonts even though the xpdfrc file was placed in /usr/local/etc/xpdfrc. This wiki has been updated to reflect the slight differences in the re-run of the evaluation. In general, it appears, based on the common words metric, that text extraction did improve for Japanese and Chinese PDFs when we used the -cfg option to manually specify the xpdfrc config file that points to the language-specific extra font files.

Tools and Data

Operating system: Linux cloud-server-02 3.10.0-327.10.1.el7.x86_64 #1 SMP Sat Jan 23 04:54:55 EST 2016 x86_64 GNU/Linux (Red Hat Enterprise Linux Server release 7.0 (Maipo)). We relied on the default system fonts. We have made no modifications to the default OS, nor have we installed fonts.
pdftotext – we downloaded the most recent available binaries, version 4.00.01, and we followed the directions to install all language modules (see virtual machine). We wrote a simple Groovy wrapper to call a new pdftotext process for every file; if no extract file was generated by pdftotext, the Groovy script generated a 0-byte file; also, we forced a timeout after 300 seconds (5 minutes).
Tika/PDFBox – we used a snapshot version of Tika 1.20, which uses PDFBox 2.0.12. We used the default settings and did not sort by position, etc. We did enable permissions checking so that text was not extracted from PDF files that did not allow text extraction.
Tika identified 528,618 PDF files in the new pull from Common Crawl. Many of these files are truncated, and 6,787 caused permission exceptions (these are either encrypted or they do not allow extraction of text).

Exceptions

There were 58,077 empty files generated by our wrapper of pdftotext. Given that pdftotext respects access permissions, that means that there may have been up to 51,290 runtime exceptions; of these, seven files caused timeouts. Some kind of content (number of tokens > 0) was extracted for 400,782 files; there were 373,569 extracts that contained >= 100 tokens.

Aside from the "permission exceptions", there were 38,158 files that caused a runtime exception for PDFBox. Some kind of content (number of tokens > 0) was extracted for 428,706 files. Note that for this evaluation, we used a standard default content handler that appends the title to the extracted content. Therefore, we also report that there were 384,277 extracts that contained >= 100 tokens.

High Level Comparison

In the absence of ground truth, tika-eval counts the number of "common words" (see: Common Words) per file as a rough proxy for extraction quality. The extracts generated by pdftotext contained 612,693,477 "common words", and the extracts from Tika/PDFBox contained 615,269,581. By this metric, that would be a 0.42% increase in the extraction of common tokens if one moved from pdftotext to Tika/PDFBox.

In manually reviewing the results, we noticed that Tika/PDFBox was apparently performing far better when there was a disagreement in language id (see the section below on Languages).  We noticed that as the content between the two extracts became more similar, there were often more examples of areas for improvement in spacing in Tika/PDFBox's extracts than in pdftotext's extracts.  Based on this observation, we calculated the number of common tokens extracted for each tool when the language id for both extracts was the same _or_ where there was only content for one of the extracts (but not the other) \[SQL1\] and \[SQL2\].  When we do this, the difference narrows even more -- pdftotext extracts 607,187,545 common tokens, whereas PDFBox/Tika extracts 607,324,140 (a 0.02% increase).

Bottom line: based on this analysis so far, the differences between the tools for extracting electronic text are not great. It appears that PDFBox extracts slightly more content. However, the reports and this preliminary analysis point to areas for improvement in both tools and in tika-eval.

Languages

In the following, we show the top 20 languages identified in the extracted text. The tika-eval module uses the optimaize language detector (version 0.6) for language identification. The language codes are roughly ISO 639-1.

The first language is that identified in the extract from pdftotext, and the second is the language identified on the extract of PDFBox. For example 'en->fa' means that language id returned 'en' on the pdftotext extract, but 'fa' on the Tika/PDFBox extract.

Language id	Number of Files
en->en	143,784
ru->ru	44,460
fr->fr	38,872
it->it	36,433
de->de	30,151
es->es	18,335
ja->ja	16,106
el->el	9,761
fa->fa	8,486
ko->ko	8,213
zh-cn->zh-cn	5,815
tr->tr	5,477
null	3,132
vi->vi	2,981
he->he	2,280
ar->ar	2,087
ca->ca	1,275
en->fa	1,240
pt->pt	1,105
de->en	860

In the following, we show the top 10 language id pairs, where the language id differs between the extracts.

Language ids	Number of Files
en->fa	1,240
de->en	860
en->de	519
en->bn	392
ar->fa	391
it->en	209
en->it	209
fr->en	201
en->fr	149
br->bn	146

"Common Words" per Language (id)

In the following table, we present the number of "common words" extracted per language id \[SQL3\].  Note that this language "id" is based on the extracted text per tool, _not_ ground truth.  Some care has to be made in interpreting this data.

Language Id	pdftotext	Tika/PDFBox	% Change
null	5,729	2,757	-51.9%
af	10,513	11,192	6.5%
an	113,326	88,633	-21.8%
ar	2,272,891	1,832,365	-19.4%
ast	15,507	16,927	9.2%
be	34	74	117.6%
bg	22,582	25,887	14.6%
bn	5,648	201,707	3471.3%
br	21,470	21,282	-0.9%
ca	301,284	307,474	2.1%
cs	26,933	32,126	19.3%
cy	72,368	70,987	-1.9%
da	24,702	29,576	19.7%
de	30,132,564	31,337,406	4.0%
el	9,432,310	9,572,594	1.5%
en	245,298,469	256,492,056	4.6%
es	39,476,601	40,478,683	2.5%
et	38,174	23,226	-39.2%
eu	17,771	14,476	-18.5%
fa	19,376,868	21,751,814	12.3%
fi	10,195	10,628	4.2%
fr	61,980,431	65,709,586	6.0%
ga	28,809	25,165	-12.6%
gl	206,752	220,055	6.4%
gu	3,456	3,840	11.1%
he	3,389,105	3,358,894	-0.9%
hi	266,969	264,400	-1.0%
hr	30,780	30,632	-0.5%
ht	8,498	3,570	-58.0%
hu	13,283	15,661	17.9%
id	215,824	193,661	-10.3%
is	14,480	13,250	-8.5%
it	45,985,473	47,562,781	3.4%
ja	45,187,665	46,921,042	3.8%
km	5	9	80.0%
kn	4,090	4,285	4.8%
ko	4,908,528	4,990,836	1.7%
lt	5,766	5,822	1.0%
lv	12,467	10,453	-16.2%
mk	547	1,400	155.9%
ml	1,281	1,280	-0.1%
mr	22,775	23,452	3.0%
ms	234,536	286,100	22.0%
mt	39,416	29,780	-24.4%
ne	76	234	207.9%
nl	563,761	584,303	3.6%
no	55,053	56,548	2.7%
oc	838	5,017	498.7%
pa	79	107	35.4%
pl	51,976	55,890	7.5%
pt	2,196,276	2,395,862	9.1%
ro	40,358	32,007	-20.7%
ru	79,309,911	81,574,518	2.9%
sk	9,131	7,760	-15.0%
sl	9,243	13,665	47.8%
so	281,724	350,402	24.4%
sq	3,089	5,997	94.1%
sr	690	707	2.5%
sv	56,956	69,112	21.3%
sw	1,180	1,003	-15.0%
ta	1,308	1,303	-0.4%
te	3,360	5,824	73.3%
th	5,673	8,675	52.9%
tl	1,277	2,987	133.9%
tr	878,127	917,616	4.5%
uk	3,977	5,910	48.6%
ur	29,711	9,575	-67.8%
vi	2,310,556	2,455,051	6.3%
yi	28	32	14.3%
zh-cn	16,938,972	18,231,035	7.6%
zh-tw	703,272	646,268	-8.1%

When we require that both extracts for a given file have the same language id, we see some different patterns \[SQL4\].

Language Id	pdftotext	Tika/PDFBox	% Change
af	10,274	10,330	0.5%
an	68,416	63,324	-7.4%
ar	2,021,212	1,652,127	-18.3%
ast	9,931	10,009	0.8%
be	33	34	3.0%
bg	20,522	20,549	0.1%
bn	4,556	4,754	4.3%
br	7,207	7,340	1.8%
ca	242,815	246,346	1.5%
cs	24,872	25,297	1.7%
cy	51,644	51,330	-0.6%
da	22,737	23,412	3.0%
de	29,478,164	29,814,286	1.1%
el	9,427,793	9,371,790	-0.6%
en	243,539,028	244,571,204	0.4%
es	38,876,299	38,835,666	-0.1%
et	16,080	16,516	2.7%
eu	11,018	11,127	1.0%
fa	19,278,448	17,596,803	-8.7%
fi	9,767	9,913	1.5%
fr	61,593,227	62,039,806	0.7%
ga	22,256	22,131	-0.6%
gl	174,194	166,007	-4.7%
gu	3,456	3,633	5.1%
he	3,378,638	3,206,568	-5.1%
hi	265,152	242,522	-8.5%
hr	30,124	30,354	0.8%
ht	2,798	2,869	2.5%
hu	13,260	13,595	2.5%
id	190,295	190,072	-0.1%
is	10,718	10,777	0.6%
it	45,675,112	45,575,218	-0.2%
ja	45,046,240	45,903,641	1.9%
km	5	9	80.0%
kn	3,938	3,950	0.3%
ko	4,864,390	4,914,251	1.0%
lt	5,587	5,684	1.7%
lv	10,786	10,435	-3.3%
mk	545	1,398	156.5%
ml	1,281	1,280	-0.1%
mr	22,695	22,523	-0.8%
ms	221,213	226,932	2.6%
mt	18,253	18,777	2.9%
ne	73	83	13.7%
nl	548,538	552,625	0.7%
no	41,588	42,482	2.1%
oc	605	609	0.7%
pa	79	107	35.4%
pl	50,848	51,776	1.8%
pt	2,090,127	2,144,491	2.6%
ro	30,282	30,889	2.0%
ru	79,195,319	78,271,782	-1.2%
sk	8,745	6,776	-22.5%
sl	8,290	8,519	2.8%
so	224,340	212,438	-5.3%
sq	2,882	4,269	48.1%
sr	689	703	2.0%
sv	40,347	41,313	2.4%
sw	959	951	-0.8%
ta	1,308	1,303	-0.4%
te	3,360	3,407	1.4%
th	5,078	5,088	0.2%
tl	1,175	1,207	2.7%
tr	865,494	878,297	1.5%
uk	3,898	5,153	32.2%
ur	21,459	5,553	-74.1%
vi	2,254,738	2,264,384	0.4%
yi	28	32	14.3%
zh-cn	16,608,321	17,263,436	3.9%
zh-tw	347,067	349,029	0.6%

Further evaluation and analysis are required, but we should look into:

Why there are so many "common words" for bn in the first common tokens by language table? 2. Are there systematic areas for improvements in PDFBox for hi (-8.5%), he (-5.1%) and Arabic script languages: ar (-18%), fa (-9%), ur (-74%)?

Most importantly, we need to determine if any of the above areas for inquiry are based on faults in tika-eval that should be fixed.

Follow up Analysis

1. Why there are so many "common words" for bn in the first common tokens by language table?

I ran \[SQL5\], and I manually reviewed results.  I observed the following points:

There was only one out of the 100 documents that had what looked like Bangla words in the top 10 most common words for those documents. 2. My intuition from previous experience with Optimaize, and it was confirmed in looking at the top 10 words for these documents is that Optimaize prefers bn when there are many numerals and very little other language content. 3. As in previous work with Optimaize, I was struck that the confidence levels are typically very high (~0.999) even when there is very little content. For example, commoncrawl3/7A/7AZUB5NHLJN3TBCMEP2YRSRK6DDNBP5F is mostly comprised of the UTF-8 replacement character, "EF BF BD" (equivalent to U+FFFD) (~13,000 of these); there are a few new lines, a few tabs, a few numerals, and the word 'untitled', and yet Optimaize's confidence is 0.9999907612800598 that this is Bangla. 4. The current OOV% metric does not take calculate a confidence. If there's just one alphanumeric term and it happens to be in the dictionary, then the OOV% is 0%, which is less than entirely useful. It would be better to improve our "language-y" score or its inverse, the "junk" score (see TIKA-1443), to include a confidence interval based on the amount of input. 5. When tika-eval doesn't have a "common words" list for a language, e.g., bn, it backs-off and uses the English list. Given that the internet is overwhelmingly English and given that the commoncrawl3 regression corpus contains quite a bit of English, and given that content from the title metadata field slipped into the extracted text for the PDFBox/Tika extracts, this backing-off to English can lead to misleading results.

My conclusion is that most of the documents that received a language id of bn actually contain a high percentage of junk.

Recommendations:

We should experiment with other language detectors and evaluate them for the traditional language-id performance measures: accuracy and speed on known language content. However, we should also evaluate them on how well they handle various types of degraded text to confirm that the confidence scores are related to the noise – content that contains 98% junk text should not receive a language id confidence of 99.999%. 2. We should augment our "common words" lists to cover all languages identified by whichever language detector we choose. We should not back-off to the English list for "common words". 3. We should continue to work on/develop a junk metric that is more nuanced than the simple sum of "Common Tokens" and the OOV%. The metrics should take the following into account:
1. Amount of evidence. 2. Alignment of distribution of token lengths relative to the "id'd" language (this will be useless with CJK, which are simply bigrammed by tika-eval; but it might be very useful for most other languages). 3. Amount of symbols and U+FFFD characters vs. the alphabetic tokens. 4. Instead of binary OOV%, it might be useful to calculate alignment to a Zipf distribution or simply similarity to a language model – we'd need to include % of words in the common words file. 5. Incorrect duplication of text. For file, commoncrawl3/2E/2EXCWC7T6P5ZY6DINFI3X2UQNIMAISKT, tika-eval shows an increase in Common Tokens of 50,372 tokens if switching from pdftotext to PDFBox/Tika. However, this file has an absurd amount of duplicate text in the headers – 17,000 occurrences of "training" in the PDFBox/Tika extract, and only 230 in the pdftotext extract. PDFBox/Tika correctly suppresses these duplicate text portions if setSuppressDuplicateOverlappingText is set to true, but Tika's default is not to suppress duplicate text. One consideration is that for this file, the % of OOV is 39% in pdftotext but only 8% in the text extracted by PDFBox/Tika. This suggests that it might be better, instead of simply summing the common tokens, to sum them only in files which have an OOV% which is within the norm (say, one stddev). As a side note, 40% is fairly common for OOV for English documents – the median is 45%, and the stddev is 14%.

2. Are there systematic areas for improvements in PDFBox for hi (-8.5%), he (-5.1%) and Arabic script languages: ar (-18%), fa (-8%), ur (-74%)?

I don't know these languages, but I ran \[SQL7\] and then put the contents of _TOP_10_UNIQUE_TOKEN_DIFFS_A_ and _TOP_10_UNIQUE_TOKEN_DIFFS_B_ through Google translate.  For example, for the top 10 unique words in _commoncrawl3_refetched/XH/XHYIWIBT5QPY64UYUPLXZXAYC2I5JPZS_:

ميں: 532 | ہے: 520 | كے: 450 | ہيں: 370 | كہ: 365 | كو: 343 | سے: 342 | كا: 297 | ہم: 280 | جناب: 254

are translated as:

I: 532 | Is: 520 | Of: 450 | Are: 370 | Yes: 365 | Who: 343 | From: 342 | : 297 | We: 280 | Mr.: 254

Whereas PDFBox/Tika's unique tokens

ںيم: 564 | ےہ: 537 | ےك: 468 | ںيہ: 386 | ہك: 365 | وك: 360 | ےس: 348 | اك: 306 | مہ: 281 | انجب: 250

are translated as:

Th: 564 | Yes: 537 | S: 468 | Yes: 386 | Hak: 365 | Ki: 360 | S: 348 | A: 306 | Mah: 281 | Ingredients: 250

Overall, this method wasn't able to yield satisfactory insight about general patterns. In some cases, the individual terms looked better in one tool or the other and vice versa.

I did note that there were more cases in PDFBox's extracted text of numerals concatenated with words as in commoncrawl3/JG/JGE6WTYI5SEI3Z4JUULIPSSRTNL3VMIG:

TOP_10_UNIQUE_TOKEN_DIFFS_A

	1: 167 | رياضي: 167 | 9: 44 | 8: 38 | 7: 28 | 6: 16 | 5: 9 | 4: 6 | 3: 2 | 9622243

TOP_10_UNIQUE_TOKEN_DIFFS_B

        رياضي: 44 | 8رياضي: 38 | 7رياضي: 28 | 10رياضي: 24 | 6رياضي: 16 | 5رياضي: 9 | 4رياضي: 6 | 3رياضي: 2 | 96222431: 1

Post-Study Reflection/Areas for Improvements

Overall improvements to this process

The wrapper around pdftotext should have "caught" the exception written to stderr and stored that as we do with exceptions from Tika.
Tika currently includes the file's 'title' metadata in the content of the file. This gives the misleading impression that some content was extracted from the file when, in fact, only the title was extracted from the XMP or metadata. Next time, we should use a content handler that only includes the extracted text.
Next time we run this evaluation, we should specify -cfg from the commandline and/or figure out why our pdfrc file wasn't being read where we placed it.

Improvements to tika-eval

We observed a handful of cases where the number of "common words" increased, but the content extracted was probably worse between two tools. This happened when one tool added spaces incorrectly, but the sub-words were actual words within the language. See, for example: commoncrawl3/TF/TFNFGXL27M77Q6X42ECYWJNSJ32WES74 (Russian) and commoncrawl3/FS/FSEHYPPOEV6EUYND5BRP3BBNAI5FVPYP (German) ("fachgruppe" vs "fach gruppe" and "ermoglicht" and "ermog" "licht")
If there's an "extract exception", meaning an empty file or an incomplete json file, we include that information in the containers table, but we don't include a row for that file in the profiles table. This causes some of the SQL that ships with tika-eval to result in not-quite-fair comparisons; some of the SQL that takes into account "runtime exceptions" fails to take into account "extract exceptions."
See the point above about improving the "junk" metric.

SQL

\[SQL1\]

select sum(cb.num_common_tokens) from contents_b cb
join profiles_b pb on pb.id=cb.id
left join profiles_a pa on pb.id=pa.id
left join contents_a ca on pa.id=ca.id
where pa.is_embedded = false and pb.is_embedded=false
and (ca.lang_id_1 = cb.lang_id_1
or ca.lang_id_1 is null)

\[SQL2\]

select sum(ca.num_common_tokens) from contents_a ca
join profiles_a pa on pa.id=ca.id
left join profiles_b pb on pa.id=pb.id
left join contents_b cb on pb.id=cb.id
where pa.is_embedded = false and pb.is_embedded=false
and (cb.lang_id_1 = ca.lang_id_1
or cb.lang_id_1 is null)

\[SQL3\]

select lang_id_1, sum(num_common_tokens) as total_common_tokens
from contents_b
group by lang_id_1
order by lang_id_1

\[SQL4\]

select ca.lang_id_1, sum(ca.num_common_tokens)
from contents_a ca
join contents_b  cb on ca.id=cb.id
where ca.lang_id_1=cb.lang_id_1
group by ca.lang_id_1
order by ca.lang_id_1

\[SQL5\]

select ca.lang_id_1, ca.top_n_tokens, cb.top_n_tokens from contents_b cb
join contents_a ca on cb.id=ca.id
where cb.lang_id_1 = 'bn'
order by rand()
limit 100;

\[SQL6\]

select  ca.id, file_path, 
1-(cast(ca.num_common_tokens as float) / cast(ca.num_alphabetic_tokens as float)) as OOV_A,
ca.num_alphabetic_tokens,
1-(cast(cb.num_common_tokens as float) / cast(cb.num_alphabetic_tokens as float)) as OOV_B,
cb.num_alphabetic_tokens,
ca.lang_id_1, ca.lang_id_prob_1,
cb.lang_id_1, cb.lang_id_prob_1,
ca.top_n_tokens, cb.top_n_tokens 
from contents_b cb
join contents_a ca on cb.id=ca.id
join profiles_a pa on ca.id=pa.id
join containers c on pa.container_id=c.container_id
where cb.lang_id_1 = 'bn' and
ca.num_alphabetic_tokens > 0
and cb.num_alphabetic_tokens > 0
order by OOV_B asc
limit 100;

\[SQL7\]

select file_path, ca.top_n_tokens, cb.top_n_tokens,
(cb.num_common_tokens-ca.num_common_tokens) as delta_common_tokens,
top_10_unique_token_diffs_a, top_10_unique_token_diffs_b
from contents_a ca 
join contents_b cb on ca.id=cb.id
join content_comparisons cc on cc.id=ca.id
join profiles_a pa on ca.id=pa.id
join containers cc on pa.container_id=cc.container_id
where ca.lang_id_1='ur'
and cb.lang_id_1='ur'
order by delta_common_tokens asc

How to make sense of the tika-eval reports

Exceptions aside, the critical file is content/content_diffs_with_exceptions.xlsx. This shows differences in the content that was extracted. Column TOP_10_UNIQUE_TOKEN_DIFFS_A records the top 10 most frequent tokens that appear only in "A" extracts (pdftotext); TOP_10_UNIQUE_TOKEN_DIFFS_B records the top 10 most frequent tokens that appear only in "B" extracts (Tika/PDFBox); NUM_COMMON_TOKENS_DIFF_IN_B records whether there has been an increase (positive number) or a decrease in "common tokens" if one were to move from "A" to "B" as the extraction tool.

For example, for file commoncrawl3_refetched/7L/7L6BDSEYCY3QVVPM7YYK3FVCK2ZLOSA7, NUM_COMMON_TOKENS_DIFF_IN_B has a value of '38', which suggests that there are 38 more "common words" in the text extracted by Tika/PDFBox than by pdftotext.

TOP_10_UNIQUE_TOKEN_DIFFS_A has

bklasse: 2 | gehoben: 2 | jahressiegers: 2 | untersagt: 2 | verlasslichkeit: 2 | 3321: 1 | 50jahrigen: 1 | 5282708: 1 | 60jahriges: 1 | 970843: 1

TOP_10_UNIQUE_TOKEN_DIFFS_B has

e: 23 | gen: 9 | be: 6 | te: 6 | schaft: 5 | gung: 4 | nant: 4 | nen: 4 | o: 4 | ten: 4

This probably means that some words were incorrectly split by PDFBox ("tool B"); and it may mean that a hyphen was incorrectly dropped in a few words by pdftotext ("tool A"): "bklasse", "50jahrigen" and "60jahriges", which should probably be "b-klasse", "50-jahrigen" and "60-jahriges"