Usage & tips

Models

You can build your own models or, of course use pre-trained ones, simple download any of the models available.

knitr::kable(list_models())

Language	Component	Description	Download	Link
da	Tokenizer	Trained on conllx ddt data.	da-token.bin	http://opennlp.sourceforge.net/models-1.5/da-token.bin
da	Sentence Detector	Trained on conllx ddt data.	da-sent.bin	http://opennlp.sourceforge.net/models-1.5/da-sent.bin
da	Part of Speech Tagger	Maxent model trained on conllx ddt data.	da-pos-maxent.bin	http://opennlp.sourceforge.net/models-1.5/da-pos-maxent.bin
da	POS Tagger	Perceptron model trained on conllx ddt data.	da-pos-perceptron.bin	http://opennlp.sourceforge.net/models-1.5/da-pos-perceptron.bin
de	Tokenizer	Trained on tiger data.	de-token.bin	http://opennlp.sourceforge.net/models-1.5/de-token.bin
de	Sentence Detector	Trained on tiger data.	de-sent.bin	http://opennlp.sourceforge.net/models-1.5/de-sent.bin
de	POS Tagger	Maxent model trained on tiger corpus.	de-pos-maxent.bin	http://opennlp.sourceforge.net/models-1.5/de-pos-maxent.bin
de	POS Tagger	Perceptron model trained on tiger corpus.	de-pos-perceptron.bin	http://opennlp.sourceforge.net/models-1.5/de-pos-perceptron.bin
en	Tokenizer	Trained on opennlp training data.	en-token.bin	http://opennlp.sourceforge.net/models-1.5/en-token.bin
en	Sentence Detector	Trained on opennlp training data.	en-sent.bin	http://opennlp.sourceforge.net/models-1.5/en-sent.bin
en	POS Tagger	Maxent model with tag dictionary.	en-pos-maxent.bin	http://opennlp.sourceforge.net/models-1.5/en-pos-maxent.bin
en	POS Tagger	Perceptron model with tag dictionary.	en-pos-perceptron.bin	http://opennlp.sourceforge.net/models-1.5/en-pos-perceptron.bin
en	Name Finder	Date name finder model.	en-ner-date.bin	http://opennlp.sourceforge.net/models-1.5/en-ner-date.bin
en	Name Finder	Location name finder model.	en-ner-location.bin	http://opennlp.sourceforge.net/models-1.5/en-ner-location.bin
en	Name Finder	Money name finder model.	en-ner-money.bin	http://opennlp.sourceforge.net/models-1.5/en-ner-money.bin
en	Name Finder	Organization name finder model.	en-ner-organization.bin	http://opennlp.sourceforge.net/models-1.5/en-ner-organization.bin
en	Name Finder	Percentage name finder model.	en-ner-percentage.bin	http://opennlp.sourceforge.net/models-1.5/en-ner-percentage.bin
en	Name Finder	Person name finder model.	en-ner-person.bin	http://opennlp.sourceforge.net/models-1.5/en-ner-person.bin
en	Name Finder	Time name finder model.	en-ner-time.bin	http://opennlp.sourceforge.net/models-1.5/en-ner-time.bin
en	Chunker	Trained on conll2000 shared task data.	en-chunker.bin	http://opennlp.sourceforge.net/models-1.5/en-chunker.bin
en	Parser		en-parser-chunking.bin	http://opennlp.sourceforge.net/models-1.5/en-parser-chunking.bin
en	Coreference		coref	http://opennlp.sourceforge.net/models-1.5/coref
es	Name Finder	Person name finder model. Trained on conll02 shared task data.	es-ner-person.bin	http://opennlp.sourceforge.net/models-1.5/es-ner-person.bin
es	Name Finder	Organization name finder model. Trained on conll02 shared task data.	es-ner-organization.bin	http://opennlp.sourceforge.net/models-1.5/es-ner-organization.bin
es	Name Finder	Location name finder model. Trained on conll02 shared task data.	es-ner-location.bin	http://opennlp.sourceforge.net/models-1.5/es-ner-location.bin
es	Name Finder	Misc name finder model. Trained on conll02 shared task data.	es-ner-misc.bin	http://opennlp.sourceforge.net/models-1.5/es-ner-misc.bin
nl	Tokenizer	Trained on conllx alpino data.	nl-token.bin	http://opennlp.sourceforge.net/models-1.5/nl-token.bin
nl	Sentence Detector	Trained on conllx alpino data.	nl-sent.bin	http://opennlp.sourceforge.net/models-1.5/nl-sent.bin
nl	Name Finder	Person name finder model. Trained on conll02 shared task data.	nl-ner-person.bin	http://opennlp.sourceforge.net/models-1.5/nl-ner-person.bin
nl	Name Finder	Organization name finder model. Trained on conll02 shared task data.	nl-ner-organization.bin	http://opennlp.sourceforge.net/models-1.5/nl-ner-organization.bin
nl	Name Finder	Location name finder model. Trained on conll02 shared task data.	nl-ner-location.bin	http://opennlp.sourceforge.net/models-1.5/nl-ner-location.bin
nl	Name Finder	Misc name finder model. Trained on conll02 shared task data.	nl-ner-misc.bin	http://opennlp.sourceforge.net/models-1.5/nl-ner-misc.bin
nl	POS Tagger	Maxent model trained on conllx alpino data.	nl-pos-maxent.bin	http://opennlp.sourceforge.net/models-1.5/nl-pos-maxent.bin
nl	POS Tagger	Perceptron model trained on conllx alpino data.	nl-pos-perceptron.bin	http://opennlp.sourceforge.net/models-1.5/nl-pos-perceptron.bin
pt	Tokenizer	Trained on conllx bosque data.	pt-token.bin	http://opennlp.sourceforge.net/models-1.5/pt-token.bin
pt	Sentence Detector	Trained on conllx bosque data.	pt-sent.bin	http://opennlp.sourceforge.net/models-1.5/pt-sent.bin
pt	POS Tagger	Maxent model trained on conllx bosque data.	pt-pos-maxent.bin	http://opennlp.sourceforge.net/models-1.5/pt-pos-maxent.bin
pt	POS Tagger	Perceptron model trained on conllx bosque data.	pt-pos-perceptron.bin	http://opennlp.sourceforge.net/models-1.5/pt-pos-perceptron.bin
se	Tokenizer	Trained on conllx talbanken05 data.	se-token.bin	http://opennlp.sourceforge.net/models-1.5/se-token.bin
se	Sentence Detector	Trained on conllx talbanken05 data.	se-sent.bin	http://opennlp.sourceforge.net/models-1.5/se-sent.bin
se	POS Tagger	Maxent model trained on conllx talbanken05 data.	se-pos-maxent.bin	http://opennlp.sourceforge.net/models-1.5/se-pos-maxent.bin
se	POS Tagger	Perceptron model trained on conllx talbanken05 data.	se-pos-perceptron.bin	http://opennlp.sourceforge.net/models-1.5/se-pos-perceptron.bin

Name Tagging

<END>. is invalid
<END> . is valid

Use check_tags to make sure they are correct.

Tagger

A currently basic tagger to easily tag training data to train a token name finder (tnf_train).

# Manually tagged
manual <- paste("This organisation is called the <START:wef> World Economic Forum <END>",
              "It is often referred to as <START:wef> Davos <END> or the <START:wef> WEF <END> .")

# Create untagged string              
data <- paste("This organisation is called the World Economic Forum",
  "It is often referred to as Davos or the WEF.")

# tag string
auto <- tag_docs(data, "WEF", "wef")
auto <- tag_docs(auto, "World Economic Forum", "wef")
auto <- tag_docs(auto, "Davos", "wef")

identical(manual, auto)

Training data

Token name finder

You will need considerable training data for the name extraction; 15’000 sentences. However, this does not mean 15’000 tagged sentences, this means 15’000 sentences representative of the documents you will have to extract names from.

Including sentences that do not contain tagged names reduces false positives; the model learns what to extract as much as it learns what not to extract.

Document classifier

In order to train a decent document classifier you are going to need 5’000 classified documents as training data with a bare minimum of 5 documents per category.

library(decipher)

# get working directory
# need to pass full path
wd <- getwd()

data <- data.frame(
  class = c("Sport", "Business", "Sport", "Sport", "Business", "Politics", "Politics", "Politics"),
  doc = c("Football, tennis, golf and, bowling and, score.",
          "Marketing, Finance, Legal and, Administration.",
          "Tennis, Ski, Golf and, gym and, match.",
          "football, climbing and gym.",
          "Marketing, Business, Money and, Management.",
          "This document talks politics and Donal Trump.",
          "Donald Trump is the President of the US, sadly.",
          "Article about politics and president Trump.")
)

# Error not enough data
model <- dc_train(model = paste0(wd, "/model.bin"), data = data, lang = "en")
#> Indexing events with TwoPass using cutoff of 5
#> 
#>  Computing event counts...  done. 8 events
#>  Indexing...  Dropped event Sport:[bow=football,, bow=climbing, bow=and, bow=gym.]
#> Dropped event Politics:[bow=This, bow=document, bow=talks, bow=politics, bow=and, bow=Donal, bow=Trump.]
#> Dropped event Politics:[bow=Donald, bow=Trump, bow=is, bow=the, bow=President, bow=of, bow=the, bow=US,, bow=sadly.]
#> Dropped event Politics:[bow=Article, bow=about, bow=politics, bow=and, bow=president, bow=Trump.]
#> done.
#> Sorting and merging events... done. Reduced 4 events to 2.
#> Done indexing in 0.02 s.
#> Incorporating indexed data for training...  
#> done.
#>  Number of Event Tokens: 2
#>      Number of Outcomes: 3
#>    Number of Predicates: 1
#> ...done.
#> Computing model parameters ...
#> Performing 100 iterations.
#>   1:  ... loglikelihood=-4.394449154672439   0.5
#>   2:  ... loglikelihood=-3.8421887157189216  0.5
#>   3:  ... loglikelihood=-3.6154430125057786  0.5
#>   4:  ... loglikelihood=-3.465596155326559   0.5
#>   5:  ... loglikelihood=-3.357105109210417   0.5
#>   6:  ... loglikelihood=-3.274527687900086   0.5
#>   7:  ... loglikelihood=-3.209347253794932   0.5
#>   8:  ... loglikelihood=-3.1564421346077856  0.5
#>   9:  ... loglikelihood=-3.1125409493265437  0.5
#>  10:  ... loglikelihood=-3.075452811722954   0.5
#>  11:  ... loglikelihood=-3.043653058837632   0.5
#>  12:  ... loglikelihood=-3.0160464644888547  0.5
#>  13:  ... loglikelihood=-2.991825031705967   0.5
#>  14:  ... loglikelihood=-2.970378973443104   0.5
#>  15:  ... loglikelihood=-2.951238991461673   0.5
#>  16:  ... loglikelihood=-2.934037698521228   0.5
#>  17:  ... loglikelihood=-2.918483147064785   0.5
#>  18:  ... loglikelihood=-2.904340240566901   0.5
#>  19:  ... loglikelihood=-2.8914174106886295  0.5
#>  20:  ... loglikelihood=-2.879556893058896   0.5
#>  21:  ... loglikelihood=-2.868627512822131   0.5
#>  22:  ... loglikelihood=-2.858519252812046   0.5
#>  23:  ... loglikelihood=-2.8491391089469005  0.5
#>  24:  ... loglikelihood=-2.8404078891499767  0.5
#>  25:  ... loglikelihood=-2.8322577133850033  0.5
#>  26:  ... loglikelihood=-2.8246300412382315  0.5
#>  27:  ... loglikelihood=-2.8174741010412983  0.5
#>  28:  ... loglikelihood=-2.8107456278868233  0.5
#>  29:  ... loglikelihood=-2.8044058416105955  0.5
#>  30:  ... loglikelihood=-2.798420612901019   0.5
#>  31:  ... loglikelihood=-2.7927597781512117  0.5
#>  32:  ... loglikelihood=-2.787396572848336   0.5
#>  33:  ... loglikelihood=-2.7823071601298786  0.5
#>  34:  ... loglikelihood=-2.777470236275584   0.5
#>  35:  ... loglikelihood=-2.7728666988024067  0.5
#>  36:  ... loglikelihood=-2.7684793658127664  0.5
#>  37:  ... loglikelihood=-2.7642927375468878  0.5
#>  38:  ... loglikelihood=-2.760292792877533   0.5
#>  39:  ... loglikelihood=-2.7564668148843223  0.5
#>  40:  ... loglikelihood=-2.7528032407468404  0.5
#>  41:  ... loglikelihood=-2.74929153206948    0.5
#>  42:  ... loglikelihood=-2.7459220624478426  0.5
#>  43:  ... loglikelihood=-2.742686019645536   0.5
#>  44:  ... loglikelihood=-2.7395753202010997  0.5
#>  45:  ... loglikelihood=-2.7365825346502968  0.5
#>  46:  ... loglikelihood=-2.7337008218468335  0.5
#>  47:  ... loglikelihood=-2.730923871108338   0.5
#>  48:  ... loglikelihood=-2.72824585111486    0.5
#>  49:  ... loglikelihood=-2.7256613646526846  0.5
#>  50:  ... loglikelihood=-2.723165408433481   0.5
#>  51:  ... loglikelihood=-2.720753337333093   0.5
#>  52:  ... loglikelihood=-2.7184208324896924  0.5
#>  53:  ... loglikelihood=-2.7161638727811193  0.5
#>  54:  ... loglikelihood=-2.7139787092686083  0.5
#>  55:  ... loglikelihood=-2.711861842250955   0.5
#>  56:  ... loglikelihood=-2.709810000621419   0.5
#>  57:  ... loglikelihood=-2.7078201232605617  0.5
#>  58:  ... loglikelihood=-2.7058893422331587  0.5
#>  59:  ... loglikelihood=-2.7040149675871263  0.5
#>  60:  ... loglikelihood=-2.702194473577981   0.5
#>  61:  ... loglikelihood=-2.7004254861643284  0.5
#>  62:  ... loglikelihood=-2.6987057716388008  0.5
#>  63:  ... loglikelihood=-2.6970332262752077  0.5
#>  64:  ... loglikelihood=-2.695405866886859   0.5
#>  65:  ... loglikelihood=-2.6938218222032515  0.5
#>  66:  ... loglikelihood=-2.692279324983063   0.5
#>  67:  ... loglikelihood=-2.6907767047906637  0.5
#>  68:  ... loglikelihood=-2.68931238137153    0.5
#>  69:  ... loglikelihood=-2.6878848585690607  0.5
#>  70:  ... loglikelihood=-2.6864927187315444  0.5
#>  71:  ... loglikelihood=-2.6851346175635373  0.5
#>  72:  ... loglikelihood=-2.6838092793807213  0.5
#>  73:  ... loglikelihood=-2.682515492731618   0.5
#>  74:  ... loglikelihood=-2.681252106353268   0.5
#>  75:  ... loglikelihood=-2.6800180254313566  0.5
#>  76:  ... loglikelihood=-2.6788122081382024  0.5
#>  77:  ... loglikelihood=-2.6776336624246775  0.5
#>  78:  ... loglikelihood=-2.6764814430444503  0.5
#>  79:  ... loglikelihood=-2.6753546487910365  0.5
#>  80:  ... loglikelihood=-2.6742524199299984  0.5
#>  81:  ... loglikelihood=-2.673173935810302   0.5
#>  82:  ... loglikelihood=-2.672118412640326   0.5
#>  83:  ... loglikelihood=-2.6710851014153434  0.5
#>  84:  ... loglikelihood=-2.670073285984502   0.5
#>  85:  ... loglikelihood=-2.669082281246406   0.5
#>  86:  ... loglikelihood=-2.6681114314633576  0.5
#>  87:  ... loglikelihood=-2.6671601086852044  0.5
#>  88:  ... loglikelihood=-2.666227711274505   0.5
#>  89:  ... loglikelihood=-2.6653136625254596  0.5
#>  90:  ... loglikelihood=-2.6644174093696673  0.5
#>  91:  ... loglikelihood=-2.663538421162381   0.5
#>  92:  ... loglikelihood=-2.6626761885434305  0.5
#>  93:  ... loglikelihood=-2.6618302223674837  0.5
#>  94:  ... loglikelihood=-2.661000052698734   0.5
#>  95:  ... loglikelihood=-2.6601852278655103  0.5
#>  96:  ... loglikelihood=-2.6593853135706516  0.5
#>  97:  ... loglikelihood=-2.6585998920538243  0.5
#>  98:  ... loglikelihood=-2.657828561302261   0.5
#>  99:  ... loglikelihood=-2.657070934306653   0.5
#> 100:  ... loglikelihood=-2.656326638359206   0.5

Sys.sleep(15)

# repeat data 50 times
# Obviously do not do that in te real world
data <- do.call("rbind", replicate(50, data[sample(nrow(data), 4),],
                                   simplify = FALSE))

# train model
model <- dc_train(model = paste0(wd, "/model.bin"), data = data, lang = "en")
#> Indexing events with TwoPass using cutoff of 5
#> 
#>  Computing event counts...  done. 200 events
#>  Indexing...  done.
#> Sorting and merging events... done. Reduced 200 events to 8.
#> Done indexing in 0.04 s.
#> Incorporating indexed data for training...  
#> done.
#>  Number of Event Tokens: 8
#>      Number of Outcomes: 3
#>    Number of Predicates: 39
#> ...done.
#> Computing model parameters ...
#> Performing 100 iterations.
#>   1:  ... loglikelihood=-219.72245773362195  0.335
#>   2:  ... loglikelihood=-142.88196406072186  1.0
#>   3:  ... loglikelihood=-106.34647714324419  1.0
#>   4:  ... loglikelihood=-84.4755805526073    1.0
#>   5:  ... loglikelihood=-69.8854447064422    1.0
#>   6:  ... loglikelihood=-59.478523742925404  1.0
#>   7:  ... loglikelihood=-51.69879250162104   1.0
#>   8:  ... loglikelihood=-45.67441939631744   1.0
#>   9:  ... loglikelihood=-40.87873008584082   1.0
#>  10:  ... loglikelihood=-36.9751384439237    1.0
#>  11:  ... loglikelihood=-33.73879993427859   1.0
#>  12:  ... loglikelihood=-31.01400451535654   1.0
#>  13:  ... loglikelihood=-28.68963344487269   1.0
#>  14:  ... loglikelihood=-26.684345273926482  1.0
#>  15:  ... loglikelihood=-24.937278865062638  1.0
#>  16:  ... loglikelihood=-23.402022062667086  1.0
#>  17:  ... loglikelihood=-22.04258525730783   1.0
#>  18:  ... loglikelihood=-20.830645280266495  1.0
#>  19:  ... loglikelihood=-19.743616728368874  1.0
#>  20:  ... loglikelihood=-18.7632755449706    1.0
#>  21:  ... loglikelihood=-17.8747592947954    1.0
#>  22:  ... loglikelihood=-17.065829440884418  1.0
#>  23:  ... loglikelihood=-16.326319087680915  1.0
#>  24:  ... loglikelihood=-15.647714125489163  1.0
#>  25:  ... loglikelihood=-15.02283173465646   1.0
#>  26:  ... loglikelihood=-14.445570898905896  1.0
#>  27:  ... loglikelihood=-13.91071683455445   1.0
#>  28:  ... loglikelihood=-13.413786247233787  1.0
#>  29:  ... loglikelihood=-12.950903829891859  1.0
#>  30:  ... loglikelihood=-12.518702899695455  1.0
#>  31:  ... loglikelihood=-12.114244855196077  1.0
#>  32:  ... loglikelihood=-11.734953431048229  1.0
#>  33:  ... loglikelihood=-11.378560679331837  1.0
#>  34:  ... loglikelihood=-11.043062312631536  1.0
#>  35:  ... loglikelihood=-10.726680572862678  1.0
#>  36:  ... loglikelihood=-10.427833189441525  1.0
#>  37:  ... loglikelihood=-10.145107294894169  1.0
#>  38:  ... loglikelihood=-9.877237399853936   1.0
#>  39:  ... loglikelihood=-9.623086710342822   1.0
#>  40:  ... loglikelihood=-9.381631211225155   1.0
#>  41:  ... loglikelihood=-9.151946050317257   1.0
#>  42:  ... loglikelihood=-8.933193844939263   1.0
#>  43:  ... loglikelihood=-8.724614602022037   1.0
#>  44:  ... loglikelihood=-8.525516998252673   1.0
#>  45:  ... loglikelihood=-8.335270811202298   1.0
#>  46:  ... loglikelihood=-8.153300328268987   1.0
#>  47:  ... loglikelihood=-7.979078589378311   1.0
#>  48:  ... loglikelihood=-7.812122343109141   1.0
#>  49:  ... loglikelihood=-7.651987615334983   1.0
#>  50:  ... loglikelihood=-7.498265805440456   1.0
#>  51:  ... loglikelihood=-7.3505802383570025  1.0
#>  52:  ... loglikelihood=-7.208583111589921   1.0
#>  53:  ... loglikelihood=-7.071952785502366   1.0
#>  54:  ... loglikelihood=-6.940391372714274   1.0
#>  55:  ... loglikelihood=-6.813622588838412   1.0
#>  56:  ... loglikelihood=-6.691389832125446   1.0
#>  57:  ... loglikelihood=-6.573454464104223   1.0
#>  58:  ... loglikelihood=-6.459594267122684   1.0
#>  59:  ... loglikelihood=-6.349602057937263   1.0
#>  60:  ... loglikelihood=-6.243284439257992   1.0
#>  61:  ... loglikelihood=-6.140460673512784   1.0
#>  62:  ... loglikelihood=-6.040961665111191   1.0
#>  63:  ... loglikelihood=-5.944629039218141   1.0
#>  64:  ... loglikelihood=-5.851314306537801   1.0
#>  65:  ... loglikelihood=-5.760878104891887   1.0
#>  66:  ... loglikelihood=-5.673189509487204   1.0
#>  67:  ... loglikelihood=-5.588125404729695   1.0
#>  68:  ... loglikelihood=-5.505569911277434   1.0
#>  69:  ... loglikelihood=-5.425413862753372   1.0
#>  70:  ... loglikelihood=-5.347554327171909   1.0
#>  71:  ... loglikelihood=-5.2718941686889345  1.0
#>  72:  ... loglikelihood=-5.19834164576965    1.0
#>  73:  ... loglikelihood=-5.1268100422952125  1.0
#>  74:  ... loglikelihood=-5.057217328503126   1.0
#>  75:  ... loglikelihood=-4.989485848986482   1.0
#>  76:  ... loglikelihood=-4.923542035267944   1.0
#>  77:  ... loglikelihood=-4.859316140721235   1.0
#>  78:  ... loglikelihood=-4.796741995840829   1.0
#>  79:  ... loglikelihood=-4.73575678206189    1.0
#>  80:  ... loglikelihood=-4.676300822511777   1.0
#>  81:  ... loglikelihood=-4.618317388233948   1.0
#>  82:  ... loglikelihood=-4.561752518566649   1.0
#>  83:  ... loglikelihood=-4.506554854485603   1.0
#>  84:  ... loglikelihood=-4.452675483832905   1.0
#>  85:  ... loglikelihood=-4.4000677974554 1.0
#>  86:  ... loglikelihood=-4.348687355366514   1.0
#>  87:  ... loglikelihood=-4.298491762126714   1.0
#>  88:  ... loglikelihood=-4.24944055071064    1.0
#>  89:  ... loglikelihood=-4.201495074195  1.0
#>  90:  ... loglikelihood=-4.154618404659525   1.0
#>  91:  ... loglikelihood=-4.108775238747462   1.0
#>  92:  ... loglikelihood=-4.063931809379816   1.0
#>  93:  ... loglikelihood=-4.0200558031604725  1.0
#>  94:  ... loglikelihood=-3.977116283049374   1.0
#>  95:  ... loglikelihood=-3.9350836159160716  1.0
#>  96:  ... loglikelihood=-3.893929404618151   1.0
#>  97:  ... loglikelihood=-3.853626424278687   1.0
#>  98:  ... loglikelihood=-3.8141485624627203  1.0
#>  99:  ... loglikelihood=-3.7754707629779007  1.0
#> 100:  ... loglikelihood=-3.737568973045476   1.0

John Coene