Differences between revisions 3 and 4
Revision 3 as of 2018-11-06 19:30:32
Size: 8859
Comment:
Revision 4 as of 2018-11-06 19:39:42
Size: 9491
Comment:
Deletions are marked like this. Additions are marked like this.
Line 7: Line 7:
While I recognize that the new metadata may be errorful, this new metadata allows for more accurate oversampling of file formats/charsets that are of interest. While I recognize that the new metadata is automatically generated and may contain errors, this new metadata allows for more accurate oversampling of file formats/charsets that are of interest.
Line 11: Line 11:
Given the work on TIKA-2038 and the focus on country top level domains (TLDs), I started by counting the number of ''mimes'' by TLD and the number of charsets by TLD ([[https://issues.apache.org/jira/secure/attachment/12945796/CC-MAIN-2018-39-mimes-charsets-by-tld.zip|here]]).

I also counted the number of "detected mimes," and the top 10 are:
The top 10 'detected mimes' are:
Line 26: Line 24:

Given the work on TIKA-2038 and the focus on country top level domains (TLDs), I also counted the number of ''mimes'' by TLD and the number of ''charsets'' by TLD ([[https://issues.apache.org/jira/secure/attachment/12945796/CC-MAIN-2018-39-mimes-charsets-by-tld.zip|here]]).

Finally, I calculated the counts for pairs of 'mime' (as alleged by the http-header) and the 'detected-mime', and that is available [[https://issues.apache.org/jira/secure/attachment/12946585/CC-MAIN-2018-39-mimes-v-detected.zip|here]].
Line 28: Line 31:
My sense from our JIRA and our user list is that people are primarily interested in office-ish files (MSOffice, RTF, eml, etc) and/or HTML. I therefore chose to break the sampling into three passes: My sense from our JIRA and our user list is that people are primarily interested in office-ish files (PDF, MSOffice, RTF, eml, etc) and/or HTML. I therefore chose to break the sampling into three passes:
Line 36: Line 39:
The sampling frame tables are available [[http://162.242.228.174/share/commoncrawl3/sampling_frames.zip|here]]; there's one sampling frame for each of the three file classes. The sampling frame tables are available [[http://162.242.228.174/share/commoncrawl3/sampling_frames.zip|here]]; there's one sampling frame for each of the three file classes.  
Line 38: Line 41:
The code is all stored on [[https://github.com/tballison/SimpleCommonCrawlExtractor|github]] '''NOTE:''' I hesitate even to use the terms "sampling and "sampling frame" because I do not mean to imply that I used much rigor. I manually calculated the sampling frames based on the total counts so that we'd have roughly the desired number of files and file types. As I describe below, there some file types that I thought we should have more of (e.g. 'octet-stream').
Line 40: Line 43:
== MSOffice and PDFs == The code for everything described here is available on [[https://github.com/tballison/SimpleCommonCrawlExtractor|github]]

== Office formats ==
Line 163: Line 168:
Tobias Ospelt and Rohan Padhye (the author of [[jqf|https://github.com/rohanpadhye/jqf]]) both noted that we could use software analysis techniques used along with fuzzing to identify a minimal corpus that would cover as much of our code base as possible. Obviously, a minimal corpus designed for our current codebase would not be guaranteed to cover new features, and we'd want to leave plenty of extra files around in the hope that some of them would capture new code paths. Tobias Ospelt and Rohan Padhye (the author of [[jqf|https://github.com/rohanpadhye/jqf]]) both noted on our dev list that we could use coverage analysis to identify a minimal corpus that would cover as much of our code base as possible. Obviously, a minimal corpus designed for our current codebase would not be guaranteed to cover new features, and we'd want to leave plenty of extra files around in the hope that some of them would capture new code paths.

Refreshing the Regression Corpus

Since the last efforts to refresh the regression corpus (TIKA-2038), CommonCrawl has added important metadata items in the indices, including: mime-detected, languages and charset. I opened TIKA-2750 to track progress on updating our corpus, and I describe the steps taken here.

There are two primary goals of TIKA-2750: include more "interesting" files, and refetch some of the files that are truncated in CommonCrawl. I don't have a definition of interesting, but the goal is to include broad coverage of file formats and languages. See below on jqf.

While I recognize that the new metadata is automatically generated and may contain errors, this new metadata allows for more accurate oversampling of file formats/charsets that are of interest.

I started by downloading the 300 index files for September 2018's crawl: CC-MAIN-2018-39 (~226GB)

The top 10 'detected mimes' are:

mime

count

text/html

2,070,375,191

application/xhtml+xml

749,683,874

image/jpeg

6,207,029

application/pdf

4,128,740

application/rss+xml

3,495,173

application/atom+xml

2,868,625

application/xml

1,353,092

image/png

585,019

text/plain

492,429

text/calendar

470,624

Given the work on TIKA-2038 and the focus on country top level domains (TLDs), I also counted the number of mimes by TLD and the number of charsets by TLD (here).

Finally, I calculated the counts for pairs of 'mime' (as alleged by the http-header) and the 'detected-mime', and that is available here.

Step 1: Select and Retrieve the Files from Common Crawl

My sense from our JIRA and our user list is that people are primarily interested in office-ish files (PDF, MSOffice, RTF, eml, etc) and/or HTML. I therefore chose to break the sampling into three passes:

  1. PDFs, MSOffice and other office-ish files
  2. Other binaries
  3. HTML/Text

I wanted to keep the corpus to below 1 TB and on the order of a few million files.

The sampling frame tables are available here; there's one sampling frame for each of the three file classes.

NOTE: I hesitate even to use the terms "sampling and "sampling frame" because I do not mean to imply that I used much rigor. I manually calculated the sampling frames based on the total counts so that we'd have roughly the desired number of files and file types. As I describe below, there some file types that I thought we should have more of (e.g. 'octet-stream').

The code for everything described here is available on github

Office formats

The top 10 file formats of this category include:

mime

count

application/pdf

4,128,740

application/vnd.openxmlformats-officedocument.wordprocessingml.document

53,579

|application/msword

52,087

application/rtf

22,509

application/vnd.ms-excel

22,067

application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

16,290

application/vnd.oasis.opendocument.text

8,314

application/vnd.openxmlformats-officedocument.presentationml.presentation

6,835

application/vnd.ms-powerpoint

5,799

application/vnd.openxmlformats-officedocument.presentationml.slideshow

2,465

select mime, sum(count) cnt
from detected_mimes
where 
(mime ilike '%pdf%' 
 OR 
 mime similar to '%(word|doc|power|ppt|excel|xls|application.*access|outlook|msg|visio|rtf|iwork|pages|numbers|keynot)%'
)
group by mime
order by cnt desc

Given how quickly the tail drops off, we could afford to take all of the non-PDFs. For PDFs, we created a sampling frame by TLD.

We used org.tallison.cc.index.mappers.DownSample to select files for downloading from Common Crawl.

Other Binaries

These are the top 10 other binaries:

mime

cnt

image/jpeg

6,207,029

application/rss+xml

3,495,173

application/atom+xml

2,868,625

application/xml

1,353,092

image/png

585,019

application/octet-stream

330,029

application/json

237,232

application/rdf+xml

229,766

image/gif

166,851

application/gzip

151,940

select mime, sum(count) cnt
from detected_mimes
where 
(mime not ilike '%pdf%' 
 and
 mime not similar to '%(word|doc|power|ppt|excel|xls|application.*access|outlook|msg|visio|rtf|iwork|pages|numbers|keynot)%'
 and mime not ilike '%html%'
 and mime not ilike '%text%'
)
group by mime
order by cnt desc

I created the sampling ratios from for these by preferring non-xml, but likely text-containing file types. Further, I wanted to include a fairly large portion of octet-stream so that we might be able to see how we can improve Tika's file detection.

We used org.tallison.cc.index.mappers.DownSample to select files for downloading from Common Crawl.

HTML/Text

For the HTML/text files, I wanted to oversample files that were not ASCII/UTF-8 English, and I wanted to oversample files that had no charset detected.

We used org.tallison.cc.index.mappers.DownSampleLangCharset to select the files for downloading from Common Crawl.

The Output

In addition to storing the files, I generated a table for each pull that included information stored in the WARC file, including information from the http-headers as archived in Common Crawl. The three table files are available here (116MB!).

Step 2: Refetch Likely Truncated Files

Common Crawl truncates files at 1MB. We've found it useful to have truncated files in our corpus, but this disproportionately affects some file formats, such as PDF and MSAccess files, and we wanted to have some recent, largish files in the corpus. We selected those files that were close to 1MB or were marked as truncated:

select url,cc_digest from crawled_files
where
(cc_mime_detected ilike '%tika%'
or cc_mime_detected ilike '%power%'
or cc_mime_detected ilike '%access%'
or cc_mime_detected ilike '%rtf%'
or cc_mime_detected ilike '%pdf%'
or cc_mime_detected ilike '%sqlite%'
 or cc_mime_detected ilike '%openxml%'
 or cc_mime_detected ilike '%word%'
 or cc_mime_detected ilike '%rfc822%'
 or cc_mime_detected ilike '%apple%'
 or cc_mime_detected ilike '%excel%'
 or cc_mime_detected ilike '%sheet%'
 or cc_mime_detected ilike '%onenote%'
or cc_mime_detected ilike '%outlook%')
and (actual_length > 990000 or warc_is_truncated='TRUE')
order by random()

A rollup of the files that were to be refetched by mime type is here:

mime

count

application/pdf

121,386

application/vnd.openxmlformats-officedocument.presentationml.presentation

3,929

application/x-tika-msoffice

3,830

application/vnd.ms-powerpoint

2,942

application/msword

2,783

application/vnd.openxmlformats-officedocument.wordprocessingml.document

2,722

application/x-tika-ooxml

2,612

application/vnd.openxmlformats-officedocument.presentationml.slideshow

1,663

application/rtf

1,569

||application/vnd.ms-excel||1,186||

The full table is here.

We used org.tallison.cc.WReGetter, a wrapper around 'wget' to re-fetch the files from the original URL. If the refetched file was > 50MB, we deleted it; and if the refetch took longer than 2 minutes, we killed the process.

We refetched these files to a new directory and stored them by their new digest. Each thread in WReGetter wrote to a table to record the mapping fo the original digest to the new digest and whether the new file was successfully refetched and/or was too big.

We then randomly deleted 80% of the original truncated files and moved the other 20% to /commoncrawl3_truncated.

Finally, we moved the refetched files into the main /commoncrawl3 directory.

Step 3 -- TODO

We carried out this work on one of our TB drives. We have to figure out what to keep from our older commoncrawl2 collection and then merge the two collections.

Coverage Metrics

Tobias Ospelt and Rohan Padhye (the author of https://github.com/rohanpadhye/jqf) both noted on our dev list that we could use coverage analysis to identify a minimal corpus that would cover as much of our code base as possible. Obviously, a minimal corpus designed for our current codebase would not be guaranteed to cover new features, and we'd want to leave plenty of extra files around in the hope that some of them would capture new code paths.

Nevertheless, if we could use jqf or another tool to reduce the corpus, that would help make our runs more efficient.

On TIKA-2750, Tobias reported that his experiment with afl-cmin.py showed that it would take roughly four months on our single VM just to create traces (~300 files per hour).

CommonCrawl3 (last edited 2018-11-26 17:54:19 by TimothyAllison)