Jerome Charron
<<MailTo(jerome.charron AT PASDEPOURRIELS gmail DOT com)>>
Activities
Nutch contributions
- MimeTypeUtil package (org.apache.nutch.util.mime)
- LanguageIdentifierPlugin
- Some benchs LanguageIdentifierBenchs
- Enhance the LanguageParseFilter by checking the validity of the parsed language string.
- TODO: Enhance the LanguageParseFilter by correlating (instead of taking only the first information available) all the clues available : DublinCore / Meta-Http-Equiv / Content-Language and statistical content analysis.
- TODO: Improve API :
- returns an ordered list of candidate languages instead of just one.
- See also Andrzej comments :
- exporting a list of supported languages,
- exporting an NGramProfile of the analyzed text,
- allow processing of chunks of input.
- MultiLingualSupport proposal.
- ParserFactoryImprovementProposal
- TODO: Use content-type/extension-id mapping instead of content-type/plugin-id
- PluginRepository enhancements:
- Add ability to handle plugins inter-dependencies (ie, a plugin can specify it has a runtime dependency on another(s) plugin(s) using the <requires><import plugin="plugin-id"/></requires> directive in the plugin.xml plugin descriptor.
- Add ability to automatically load (depending on config) the required plugins specified by plugins dependencies (circular dependencies checked).
- MarkupLanguageParserProposal
- Microformats HtmlParseFilter:
- Nutch article on french wikipedia.
- URL Filters enhancements:
- Add a mini framework plugin for regular expression based URL Filters (lib-regex-filter)
- Add a regex url filter implementation based on dk.brics.automaton Finite-State Automata for Java.
- See RegexURLFiltersBenchs for a comparison of urlfilter-regex and urlfilter-automaton plugins
CategoryHomepage