Training

These sections demonstrate how to train a Snowball relationship extraction system to extract NĂ©el temperature relationships from sentences within a scientific article.

The general training process works as follows:

Curie Temperature Relationships

The Curie temperature of a magnetic material describes the temperature at which the material changes from >being paramagnetic to ferromagnetic. As such, a Curie temperature relationship consists of 4 entities:

Defining the Entities

First define a standard ChemDataExtractor Model for Curie Temperature


from chemdataextractor.relex import Snowball, ChemicalRelationship
from chemdataextractor.model import BaseModel, StringType, ListType, ModelType, Compound
from chemdataextractor.parse import R, I, W, Optional, merge, join, OneOrMore, Any, ZeroOrMore, Start
from chemdataextractor.parse.cem import chemical_name, chemical_label
from chemdataextractor.parse.base import BaseParser
from chemdataextractor.parse.common import lrb, rrb, delim
from chemdataextractor.utils import first
from chemdataextractor.doc import Paragraph, Heading, Sentence
from lxml import etree
import re

class  CurieTemperature(BaseModel):
    specifier = StringType()
    value = StringType()
    units = StringType()

Compound.curie_temperatures = ListType(ModelType(CurieTemperature))

	

Now define expressions for identifying the entities


# Define a very basic entity tagger
specifier = (I('curie') + I('temperature') + Optional(lrb | delim) + Optional(R('^T(C|c)(urie)?')) + Optional(rrb) | R('^T(C|c)(urie)?'))('specifier').add_action(join)
units = (R('^[CFK]\.?$'))('units').add_action(merge)
value = (R('^\d+(\.\,\d+)?$'))('value')

Note we tag each with a unique identifier that will be used later. Now let the entities in a sentence be any ordering of these (or whatever ordering you feel like). Here we specify that the value and units must coincide, but this does not have to be the case. We also define an extremely general parse phrase, this will be used to identify candidate sentences.


# Let the entities be any combination of chemical names, specifier values and units
entities = (chemical_name | specifier | value + units)

# Now create a very generic parse phrase that will match any combination of these entities
curie_temperature_phrase = (entities + OneOrMore(entities | Any()))('curie_temperature')

# List all the entities
curie_temp_entities = [chemical_name, specifier, value, units]

We are now ready to start Snowballing. Lets formalise our ChemicalRelationship passing in the entities, the extraction phrase and a name.


curie_temp_relationship = ChemicalRelationship(curie_temp_entities, curie_temperature_phrase, name='curie_temperatures')

Training the system

Create a Snowball object to use on our relationship and point to a path for training. Here will we use the default parameters:

Note increasing TC and Tsim yields more extraction patterns but stricter rules on new relations.

Now create a Snowball object and begin training


snowball = Snowball(curie_temp_relationship)
snowball.train(corpus='../tests/data/relex/curie_training_set/')

The training process in online. This means that the user can train the system on as many papers as they like, and it will continue to update the knowledge base. At each paper, the sentences are scanned for any matches to the parse phrase, and if the sentence matches, candidate relationships are formed. There can be many candidate relationships in a single sentence, so the output provides the user will all available candidates. The user can specify to accept a relationship by typing in the number (or numbers) of the candidates they wish to accept. I.e. If you want candidate 0 only, type '0' then press enter. If you want 0 and 3 type '0,3' and press enter. If you dont want any, then press any other key. e.g. 'n' or 'no'. This training process automatically clusters the sentences you accept and updates the knowlede base. You can check what has been learned by searching in the relex/data folder. You can always stop training and start again, or come back to the same training process if you wish, simply load in an existing snowball system using: Snowball.load()

Seeing what has been learned

Looking into data/relex/curie_temperatures_patterns.txt, we see what patterns were learned from our training: