Apache Tika's Html Encoding Study

In support of TIKA-2038, we gathered a new subset of html pages from CC-MAIN-2017-04.

This page offers a first rough draft of the process. Some of the code is available on a personal github site. This code relies heavily on Dominik Stadler's CommonCrawlDocumentDownload code, and the author of SimpleCommonCrawlExtractor is extremely grateful to Dominik.

  1. Determined which top level domains (TLDs) were of interest
  2. Downloaded the 300 index files from Common Crawl via Groovy (217 GB of data):
    •         def cc = "CC-MAIN-2017-04"
              def url1 = "https://commoncrawl.s3.amazonaws.com/cc-index/collections/"
              def url2 = "/indexes/cdx-"
              (0..299).each{ i ->
                  def u = url1+cc+url2+"$i".padLeft(5, '0')+".gz"
                  def p = "wget -q $u".execute()
                  p.waitForProcessOutput(System.out, System.err);
  3. Counted the number of pages per TLD that had "html/text" in the http Content-Type header
    • Map:
             java -cp cc-extractor-0.0.1.jar org.tallison.cc.index.CCIndexBatchReader 
                 10 /data1/commoncrawl_indices/CC-MAIN-2017-04/ CountMimesByTopLevelDomains 
             java -cp cc-extractor-0.0.1.jar org.tallison.cc.index.reducers.DoubleKeyReducer 
                 mime_tld_counts mime_tld_total.txt
  4. Created sampling frequencies per TLD, with a target of 50k per TLD, with the exception of 100k for ".com" -- this was done by loading mime_tld_total.txt into a database and doing some group by queries. See tld_mimes.txt.

  5. Randomly sampled according to the sampling frequencies per TLD from the 300 index files
    • Map:
             java -cp cc-extractor-0.0.1.jar org.tallison.cc.index.CCIndexBatchReader 
                 10 /data1/commoncrawl_indices/CC-MAIN-2017-04/ DownSample 
                 tld_mimes.txt tld_mimes_down_sampled
             java -cp cc-extractor-0.0.1.jar org.tallison.cc.index.reducers.ConcatReducer 
                 tld_mimes_down_sampled tld_mimes_down_sampled_index
  6. Pulled the data from Common Crawl
    •        java -cp cc-extractor-0.0.1.jar org.tallison.cc.CCGetter 
                 tld_mimes_down_sampled_index /data4/docs/commoncrawl_html_study 

ApacheTikaHtmlEncodingStudy (last edited 2017-02-21 02:09:11 by TimothyAllison)