CheTA

Principal Investigator: 
Dr Peter Murray-Rust

Chemistry Using Text Annotations

This project (CheTA) will integrate Cambridge's chemical text mining tool OSCAR with the U-Compare1 workflow infrastructure developed by NaCTeM and others. This integration adds chemistry to the world's largest public collection of interoperable text mining tools and will be highly valued by influential stakeholders both in the JISC community and the wider chemistry community. After a baseline study (UCC and RSC) and the integration have been accomplished, the project will use the CheTA tools to index a corpus of documents of different types and provenance. CheTA will develop a rigorous evaluation framework with annotation studies for a formal scientific evaluation of the system ('Are we extracting metadata correctly' - RSC/NaCTeM), user requirements studies for the metadata needs of 'real world users' ('What metadata is useful?' - RSC/UCC) and comparing extracted metadata against the usefulness (all project partners). Furthermore the CheTA system will be compared with the performance of the Thomson Reuters OpenCalais service2 enhanced with a chemistry lexicon. Finally,  the economic cost of metadata generation by both human indexers and robots will be quantified.

It is expected that the application of professionally maintained, automated and sustainable text mining services, enabled by CheTA to public information sources such as PubMed, will lead to significant future enhancements in resource discovery.

The project will create a high-quality infrastructure for running chemical textmining which will be released as open Source.

 http://www.jisc.ac.uk/whatwedo/programmes/inf11/resdis/cheta