Refreshing the Regression Corpus

Since the last efforts to refresh the regression corpus (TIKA-2038), CommonCrawl has added important metadata items in the indices, including: mime-detected, languages and charset. I opened TIKA-2750 to track progress on updating our corpus.

There are two primary goals of TIKA-2750: include more "interesting" files, and refetch some of the files that are truncated in CommonCrawl. I don't have a definition of interesting, but the goal is to include broad coverage of file formats and languages.

While I recognize that the new metadata may be errorful, this new metadata allows for more accurate oversampling of file formats/charsets that are of interest.

I started by downloading the 300 index files for September 2018's crawl: CC-MAIN-2018-39 (~226GB)

Given the work on TIKA-2038 and the focus on country top level domains (TLDs), I started by counting the number of mimes by TLD and the number of charsets by TLD (here).

I also counted the number of "detected mimes," and the top 10 are:

mime

count

text/html

2,070,375,191

application/xhtml+xml

749,683,874

image/jpeg

6,207,029

application/pdf

4,128,740

application/rss+xml

3,495,173

application/atom+xml

2,868,625

application/xml

1,353,092

image/png

585,019

text/plain

492,429

text/calendar

470,624

Given our interest in office-ish files, I chose to break the sampling into three passes:

  1. MSOffice and PDFs
  2. Other binaries
  3. HTML/Text

I wanted to keep the corpus to below 1 TB and on the order of a few million files.

MSOffice and PDFs

The top 10 file formats of this category included:

application/pdf

4,128,740

application/vnd.openxmlformats-officedocument.wordprocessingml.document

53,579

|application/msword

52,087

application/rtf

22,509

application/vnd.ms-excel

22,067

application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

16,290

application/vnd.oasis.opendocument.text

8,314

application/vnd.openxmlformats-officedocument.presentationml.presentation

6,835

application/vnd.ms-powerpoint

5,799

application/vnd.openxmlformats-officedocument.presentationml.slideshow

2,465

select mime, sum(count) cnt
from detected_mimes
where 
(mime ilike '%pdf%' 
 OR 
 mime similar to '%(word|doc|power|ppt|excel|xls|application.*access|outlook|msg|visio|rtf|iwork|pages|numbers|keynot)%'
)
group by mime
order by cnt desc

Given how quickly the tail drops off, we could afford to take all of the non-PDFs. For PDFs, we created a sampling frame by TLD.

Other Binaries

HTML/Text