Differences between revisions 1 and 2
Revision 1 as of 2018-11-06 17:47:35
Size: 2908
Comment:
Revision 2 as of 2018-11-06 17:56:19
Size: 4076
Comment: check point commit
Deletions are marked like this. Additions are marked like this.
Line 26: Line 26:
= Step 1: Select and Retrieve the Files from Common Crawl ==
Line 37: Line 39:
||mime||count||
Line 63: Line 66:
These are the top 10 other binaries:
Line 64: Line 68:
||mime||cnt||
||image/jpeg||6,207,029||
||application/rss+xml||3,495,173||
||application/atom+xml||2,868,625||
||application/xml||1,353,092||
||image/png||585,019||
||application/octet-stream||330,029||
||application/json||237,232||
||application/rdf+xml||229,766||
||image/gif||166,851||
||application/gzip||151,940||
{{{
select mime, sum(count) cnt
from detected_mimes
where
(mime not ilike '%pdf%'
 and
 mime not similar to '%(word|doc|power|ppt|excel|xls|application.*access|outlook|msg|visio|rtf|iwork|pages|numbers|keynot)%'
 and mime not ilike '%html%'
 and mime not ilike '%text%'
)
group by mime
order by cnt desc
}}}

I created the sampling ratios from for these by preferring non-xml, but likely text-containing file types. Further, I wanted to include a fairly large portion of ''octet-stream'' so that we might be able to see how we can improve Tika's file detection.
Line 66: Line 96:

For the HTML/text files, I wanted to oversample files that were not ASCII/UTF-8 English.

= Step 2: Refetch Likely Truncated Files =

Refreshing the Regression Corpus

Since the last efforts to refresh the regression corpus (TIKA-2038), CommonCrawl has added important metadata items in the indices, including: mime-detected, languages and charset. I opened TIKA-2750 to track progress on updating our corpus.

There are two primary goals of TIKA-2750: include more "interesting" files, and refetch some of the files that are truncated in CommonCrawl. I don't have a definition of interesting, but the goal is to include broad coverage of file formats and languages.

While I recognize that the new metadata may be errorful, this new metadata allows for more accurate oversampling of file formats/charsets that are of interest.

I started by downloading the 300 index files for September 2018's crawl: CC-MAIN-2018-39 (~226GB)

Given the work on TIKA-2038 and the focus on country top level domains (TLDs), I started by counting the number of mimes by TLD and the number of charsets by TLD (here).

I also counted the number of "detected mimes," and the top 10 are:

mime

count

text/html

2,070,375,191

application/xhtml+xml

749,683,874

image/jpeg

6,207,029

application/pdf

4,128,740

application/rss+xml

3,495,173

application/atom+xml

2,868,625

application/xml

1,353,092

image/png

585,019

text/plain

492,429

text/calendar

470,624

= Step 1: Select and Retrieve the Files from Common Crawl ==

Given our interest in office-ish files, I chose to break the sampling into three passes:

  1. MSOffice and PDFs
  2. Other binaries
  3. HTML/Text

I wanted to keep the corpus to below 1 TB and on the order of a few million files.

MSOffice and PDFs

The top 10 file formats of this category included:

mime

count

application/pdf

4,128,740

application/vnd.openxmlformats-officedocument.wordprocessingml.document

53,579

|application/msword

52,087

application/rtf

22,509

application/vnd.ms-excel

22,067

application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

16,290

application/vnd.oasis.opendocument.text

8,314

application/vnd.openxmlformats-officedocument.presentationml.presentation

6,835

application/vnd.ms-powerpoint

5,799

application/vnd.openxmlformats-officedocument.presentationml.slideshow

2,465

select mime, sum(count) cnt
from detected_mimes
where 
(mime ilike '%pdf%' 
 OR 
 mime similar to '%(word|doc|power|ppt|excel|xls|application.*access|outlook|msg|visio|rtf|iwork|pages|numbers|keynot)%'
)
group by mime
order by cnt desc

Given how quickly the tail drops off, we could afford to take all of the non-PDFs. For PDFs, we created a sampling frame by TLD.

Other Binaries

These are the top 10 other binaries:

mime

cnt

image/jpeg

6,207,029

application/rss+xml

3,495,173

application/atom+xml

2,868,625

application/xml

1,353,092

image/png

585,019

application/octet-stream

330,029

application/json

237,232

application/rdf+xml

229,766

image/gif

166,851

application/gzip

151,940

select mime, sum(count) cnt
from detected_mimes
where 
(mime not ilike '%pdf%' 
 and
 mime not similar to '%(word|doc|power|ppt|excel|xls|application.*access|outlook|msg|visio|rtf|iwork|pages|numbers|keynot)%'
 and mime not ilike '%html%'
 and mime not ilike '%text%'
)
group by mime
order by cnt desc

I created the sampling ratios from for these by preferring non-xml, but likely text-containing file types. Further, I wanted to include a fairly large portion of octet-stream so that we might be able to see how we can improve Tika's file detection.

HTML/Text

For the HTML/text files, I wanted to oversample files that were not ASCII/UTF-8 English.

= Step 2: Refetch Likely Truncated Files =

CommonCrawl3 (last edited 2018-11-26 17:54:19 by TimothyAllison)