These sections demonstrate how to train a Snowball relationship extraction system to extract Néel temperature relationships from sentences within a scientific article.
The general training process works as follows:
The Curie temperature of a magnetic material describes the temperature at which the material changes from >being paramagnetic to ferromagnetic. As such, a Curie temperature relationship consists of 4 entities:
First define a standard ChemDataExtractor Model for Curie Temperature
from chemdataextractor.relex import Snowball, ChemicalRelationship from chemdataextractor.model import BaseModel, StringType, ListType, ModelType, Compound from chemdataextractor.parse import R, I, W, Optional, merge, join, OneOrMore, Any, ZeroOrMore, Start from chemdataextractor.parse.cem import chemical_name, chemical_label from chemdataextractor.parse.base import BaseParser from chemdataextractor.parse.common import lrb, rrb, delim from chemdataextractor.utils import first from chemdataextractor.doc import Paragraph, Heading, Sentence from lxml import etree import re class CurieTemperature(BaseModel): specifier = StringType() value = StringType() units = StringType() Compound.curie_temperatures = ListType(ModelType(CurieTemperature))
Now define expressions for identifying the entities
# Define a very basic entity tagger specifier = (I('curie') + I('temperature') + Optional(lrb | delim) + Optional(R('^T(C|c)(urie)?')) + Optional(rrb) | R('^T(C|c)(urie)?'))('specifier').add_action(join) units = (R('^[CFK]\.?$'))('units').add_action(merge) value = (R('^\d+(\.\,\d+)?$'))('value')
Note we tag each with a unique identifier that will be used later. Now let the entities in a sentence be any ordering of these (or whatever ordering you feel like). Here we specify that the value and units must coincide, but this does not have to be the case. We also define an extremely general parse phrase, this will be used to identify candidate sentences.
# Let the entities be any combination of chemical names, specifier values and units entities = (chemical_name | specifier | value + units) # Now create a very generic parse phrase that will match any combination of these entities curie_temperature_phrase = (entities + OneOrMore(entities | Any()))('curie_temperature') # List all the entities curie_temp_entities = [chemical_name, specifier, value, units]
We are now ready to start Snowballing. Lets formalise our ChemicalRelationship passing in the entities, the extraction phrase and a name.
curie_temp_relationship = ChemicalRelationship(curie_temp_entities, curie_temperature_phrase, name='curie_temperatures')
Create a Snowball object to use on our relationship and point to a path for training. Here will we use the default parameters:
Now create a Snowball object and begin training
snowball = Snowball(curie_temp_relationship) snowball.train(corpus='../tests/data/relex/curie_training_set/')
The training process in online. This means that the user can train the system on as many papers as they like, and it will continue to update the knowledge base. At each paper, the sentences are scanned for any matches to the parse phrase, and if the sentence matches, candidate relationships are formed. There can be many candidate relationships in a single sentence, so the output provides the user will all available candidates. The user can specify to accept a relationship by typing in the number (or numbers) of the candidates they wish to accept. I.e. If you want candidate 0 only, type '0' then press enter. If you want 0 and 3 type '0,3' and press enter. If you dont want any, then press any other key. e.g. 'n' or 'no'. This training process automatically clusters the sentences you accept and updates the knowlede base. You can check what has been learned by searching in the relex/data folder. You can always stop training and start again, or come back to the same training process if you wish, simply load in an existing snowball system using: Snowball.load()
Looking into data/relex/curie_temperatures_patterns.txt, we see what patterns were learned from our training: