Models

You can build your own models or, of course use pre-trained ones, simple download any of the models available.

knitr::kable(list_models())
Language Component Description Download Link
da Tokenizer Trained on conllx ddt data. da-token.bin http://opennlp.sourceforge.net/models-1.5/da-token.bin
da Sentence Detector Trained on conllx ddt data. da-sent.bin http://opennlp.sourceforge.net/models-1.5/da-sent.bin
da Part of Speech Tagger Maxent model trained on conllx ddt data. da-pos-maxent.bin http://opennlp.sourceforge.net/models-1.5/da-pos-maxent.bin
da POS Tagger Perceptron model trained on conllx ddt data. da-pos-perceptron.bin http://opennlp.sourceforge.net/models-1.5/da-pos-perceptron.bin
de Tokenizer Trained on tiger data. de-token.bin http://opennlp.sourceforge.net/models-1.5/de-token.bin
de Sentence Detector Trained on tiger data. de-sent.bin http://opennlp.sourceforge.net/models-1.5/de-sent.bin
de POS Tagger Maxent model trained on tiger corpus. de-pos-maxent.bin http://opennlp.sourceforge.net/models-1.5/de-pos-maxent.bin
de POS Tagger Perceptron model trained on tiger corpus. de-pos-perceptron.bin http://opennlp.sourceforge.net/models-1.5/de-pos-perceptron.bin
en Tokenizer Trained on opennlp training data. en-token.bin http://opennlp.sourceforge.net/models-1.5/en-token.bin
en Sentence Detector Trained on opennlp training data. en-sent.bin http://opennlp.sourceforge.net/models-1.5/en-sent.bin
en POS Tagger Maxent model with tag dictionary. en-pos-maxent.bin http://opennlp.sourceforge.net/models-1.5/en-pos-maxent.bin
en POS Tagger Perceptron model with tag dictionary. en-pos-perceptron.bin http://opennlp.sourceforge.net/models-1.5/en-pos-perceptron.bin
en Name Finder Date name finder model. en-ner-date.bin http://opennlp.sourceforge.net/models-1.5/en-ner-date.bin
en Name Finder Location name finder model. en-ner-location.bin http://opennlp.sourceforge.net/models-1.5/en-ner-location.bin
en Name Finder Money name finder model. en-ner-money.bin http://opennlp.sourceforge.net/models-1.5/en-ner-money.bin
en Name Finder Organization name finder model. en-ner-organization.bin http://opennlp.sourceforge.net/models-1.5/en-ner-organization.bin
en Name Finder Percentage name finder model. en-ner-percentage.bin http://opennlp.sourceforge.net/models-1.5/en-ner-percentage.bin
en Name Finder Person name finder model. en-ner-person.bin http://opennlp.sourceforge.net/models-1.5/en-ner-person.bin
en Name Finder Time name finder model. en-ner-time.bin http://opennlp.sourceforge.net/models-1.5/en-ner-time.bin
en Chunker Trained on conll2000 shared task data. en-chunker.bin http://opennlp.sourceforge.net/models-1.5/en-chunker.bin
en Parser en-parser-chunking.bin http://opennlp.sourceforge.net/models-1.5/en-parser-chunking.bin
en Coreference coref http://opennlp.sourceforge.net/models-1.5/coref
es Name Finder Person name finder model. Trained on conll02 shared task data. es-ner-person.bin http://opennlp.sourceforge.net/models-1.5/es-ner-person.bin
es Name Finder Organization name finder model. Trained on conll02 shared task data. es-ner-organization.bin http://opennlp.sourceforge.net/models-1.5/es-ner-organization.bin
es Name Finder Location name finder model. Trained on conll02 shared task data. es-ner-location.bin http://opennlp.sourceforge.net/models-1.5/es-ner-location.bin
es Name Finder Misc name finder model. Trained on conll02 shared task data. es-ner-misc.bin http://opennlp.sourceforge.net/models-1.5/es-ner-misc.bin
nl Tokenizer Trained on conllx alpino data. nl-token.bin http://opennlp.sourceforge.net/models-1.5/nl-token.bin
nl Sentence Detector Trained on conllx alpino data. nl-sent.bin http://opennlp.sourceforge.net/models-1.5/nl-sent.bin
nl Name Finder Person name finder model. Trained on conll02 shared task data. nl-ner-person.bin http://opennlp.sourceforge.net/models-1.5/nl-ner-person.bin
nl Name Finder Organization name finder model. Trained on conll02 shared task data. nl-ner-organization.bin http://opennlp.sourceforge.net/models-1.5/nl-ner-organization.bin
nl Name Finder Location name finder model. Trained on conll02 shared task data. nl-ner-location.bin http://opennlp.sourceforge.net/models-1.5/nl-ner-location.bin
nl Name Finder Misc name finder model. Trained on conll02 shared task data. nl-ner-misc.bin http://opennlp.sourceforge.net/models-1.5/nl-ner-misc.bin
nl POS Tagger Maxent model trained on conllx alpino data. nl-pos-maxent.bin http://opennlp.sourceforge.net/models-1.5/nl-pos-maxent.bin
nl POS Tagger Perceptron model trained on conllx alpino data. nl-pos-perceptron.bin http://opennlp.sourceforge.net/models-1.5/nl-pos-perceptron.bin
pt Tokenizer Trained on conllx bosque data. pt-token.bin http://opennlp.sourceforge.net/models-1.5/pt-token.bin
pt Sentence Detector Trained on conllx bosque data. pt-sent.bin http://opennlp.sourceforge.net/models-1.5/pt-sent.bin
pt POS Tagger Maxent model trained on conllx bosque data. pt-pos-maxent.bin http://opennlp.sourceforge.net/models-1.5/pt-pos-maxent.bin
pt POS Tagger Perceptron model trained on conllx bosque data. pt-pos-perceptron.bin http://opennlp.sourceforge.net/models-1.5/pt-pos-perceptron.bin
se Tokenizer Trained on conllx talbanken05 data. se-token.bin http://opennlp.sourceforge.net/models-1.5/se-token.bin
se Sentence Detector Trained on conllx talbanken05 data. se-sent.bin http://opennlp.sourceforge.net/models-1.5/se-sent.bin
se POS Tagger Maxent model trained on conllx talbanken05 data. se-pos-maxent.bin http://opennlp.sourceforge.net/models-1.5/se-pos-maxent.bin
se POS Tagger Perceptron model trained on conllx talbanken05 data. se-pos-perceptron.bin http://opennlp.sourceforge.net/models-1.5/se-pos-perceptron.bin

Name Tagging

  • <END>. is invalid
  • <END> . is valid

Use check_tags to make sure they are correct.

Tagger

A currently basic tagger to easily tag training data to train a token name finder (tnf_train).

# Manually tagged
manual <- paste("This organisation is called the <START:wef> World Economic Forum <END>",
              "It is often referred to as <START:wef> Davos <END> or the <START:wef> WEF <END> .")

# Create untagged string              
data <- paste("This organisation is called the World Economic Forum",
  "It is often referred to as Davos or the WEF.")

# tag string
auto <- tag_docs(data, "WEF", "wef")
auto <- tag_docs(auto, "World Economic Forum", "wef")
auto <- tag_docs(auto, "Davos", "wef")

identical(manual, auto)

Training data

Token name finder

You will need considerable training data for the name extraction; 15’000 sentences. However, this does not mean 15’000 tagged sentences, this means 15’000 sentences representative of the documents you will have to extract names from.

Including sentences that do not contain tagged names reduces false positives; the model learns what to extract as much as it learns what not to extract.

Document classifier

In order to train a decent document classifier you are going to need 5’000 classified documents as training data with a bare minimum of 5 documents per category.

library(decipher)

# get working directory
# need to pass full path
wd <- getwd()

data <- data.frame(
  class = c("Sport", "Business", "Sport", "Sport", "Business", "Politics", "Politics", "Politics"),
  doc = c("Football, tennis, golf and, bowling and, score.",
          "Marketing, Finance, Legal and, Administration.",
          "Tennis, Ski, Golf and, gym and, match.",
          "football, climbing and gym.",
          "Marketing, Business, Money and, Management.",
          "This document talks politics and Donal Trump.",
          "Donald Trump is the President of the US, sadly.",
          "Article about politics and president Trump.")
)

# Error not enough data
model <- dc_train(model = paste0(wd, "/model.bin"), data = data, lang = "en")
#> Indexing events with TwoPass using cutoff of 5
#> 
#>  Computing event counts...  done. 8 events
#>  Indexing...  Dropped event Sport:[bow=football,, bow=climbing, bow=and, bow=gym.]
#> Dropped event Politics:[bow=This, bow=document, bow=talks, bow=politics, bow=and, bow=Donal, bow=Trump.]
#> Dropped event Politics:[bow=Donald, bow=Trump, bow=is, bow=the, bow=President, bow=of, bow=the, bow=US,, bow=sadly.]
#> Dropped event Politics:[bow=Article, bow=about, bow=politics, bow=and, bow=president, bow=Trump.]
#> done.
#> Sorting and merging events... done. Reduced 4 events to 2.
#> Done indexing in 0.02 s.
#> Incorporating indexed data for training...  
#> done.
#>  Number of Event Tokens: 2
#>      Number of Outcomes: 3
#>    Number of Predicates: 1
#> ...done.
#> Computing model parameters ...
#> Performing 100 iterations.
#>   1:  ... loglikelihood=-4.394449154672439   0.5
#>   2:  ... loglikelihood=-3.8421887157189216  0.5
#>   3:  ... loglikelihood=-3.6154430125057786  0.5
#>   4:  ... loglikelihood=-3.465596155326559   0.5
#>   5:  ... loglikelihood=-3.357105109210417   0.5
#>   6:  ... loglikelihood=-3.274527687900086   0.5
#>   7:  ... loglikelihood=-3.209347253794932   0.5
#>   8:  ... loglikelihood=-3.1564421346077856  0.5
#>   9:  ... loglikelihood=-3.1125409493265437  0.5
#>  10:  ... loglikelihood=-3.075452811722954   0.5
#>  11:  ... loglikelihood=-3.043653058837632   0.5
#>  12:  ... loglikelihood=-3.0160464644888547  0.5
#>  13:  ... loglikelihood=-2.991825031705967   0.5
#>  14:  ... loglikelihood=-2.970378973443104   0.5
#>  15:  ... loglikelihood=-2.951238991461673   0.5
#>  16:  ... loglikelihood=-2.934037698521228   0.5
#>  17:  ... loglikelihood=-2.918483147064785   0.5
#>  18:  ... loglikelihood=-2.904340240566901   0.5
#>  19:  ... loglikelihood=-2.8914174106886295  0.5
#>  20:  ... loglikelihood=-2.879556893058896   0.5
#>  21:  ... loglikelihood=-2.868627512822131   0.5
#>  22:  ... loglikelihood=-2.858519252812046   0.5
#>  23:  ... loglikelihood=-2.8491391089469005  0.5
#>  24:  ... loglikelihood=-2.8404078891499767  0.5
#>  25:  ... loglikelihood=-2.8322577133850033  0.5
#>  26:  ... loglikelihood=-2.8246300412382315  0.5
#>  27:  ... loglikelihood=-2.8174741010412983  0.5
#>  28:  ... loglikelihood=-2.8107456278868233  0.5
#>  29:  ... loglikelihood=-2.8044058416105955  0.5
#>  30:  ... loglikelihood=-2.798420612901019   0.5
#>  31:  ... loglikelihood=-2.7927597781512117  0.5
#>  32:  ... loglikelihood=-2.787396572848336   0.5
#>  33:  ... loglikelihood=-2.7823071601298786  0.5
#>  34:  ... loglikelihood=-2.777470236275584   0.5
#>  35:  ... loglikelihood=-2.7728666988024067  0.5
#>  36:  ... loglikelihood=-2.7684793658127664  0.5
#>  37:  ... loglikelihood=-2.7642927375468878  0.5
#>  38:  ... loglikelihood=-2.760292792877533   0.5
#>  39:  ... loglikelihood=-2.7564668148843223  0.5
#>  40:  ... loglikelihood=-2.7528032407468404  0.5
#>  41:  ... loglikelihood=-2.74929153206948    0.5
#>  42:  ... loglikelihood=-2.7459220624478426  0.5
#>  43:  ... loglikelihood=-2.742686019645536   0.5
#>  44:  ... loglikelihood=-2.7395753202010997  0.5
#>  45:  ... loglikelihood=-2.7365825346502968  0.5
#>  46:  ... loglikelihood=-2.7337008218468335  0.5
#>  47:  ... loglikelihood=-2.730923871108338   0.5
#>  48:  ... loglikelihood=-2.72824585111486    0.5
#>  49:  ... loglikelihood=-2.7256613646526846  0.5
#>  50:  ... loglikelihood=-2.723165408433481   0.5
#>  51:  ... loglikelihood=-2.720753337333093   0.5
#>  52:  ... loglikelihood=-2.7184208324896924  0.5
#>  53:  ... loglikelihood=-2.7161638727811193  0.5
#>  54:  ... loglikelihood=-2.7139787092686083  0.5
#>  55:  ... loglikelihood=-2.711861842250955   0.5
#>  56:  ... loglikelihood=-2.709810000621419   0.5
#>  57:  ... loglikelihood=-2.7078201232605617  0.5
#>  58:  ... loglikelihood=-2.7058893422331587  0.5
#>  59:  ... loglikelihood=-2.7040149675871263  0.5
#>  60:  ... loglikelihood=-2.702194473577981   0.5
#>  61:  ... loglikelihood=-2.7004254861643284  0.5
#>  62:  ... loglikelihood=-2.6987057716388008  0.5
#>  63:  ... loglikelihood=-2.6970332262752077  0.5
#>  64:  ... loglikelihood=-2.695405866886859   0.5
#>  65:  ... loglikelihood=-2.6938218222032515  0.5
#>  66:  ... loglikelihood=-2.692279324983063   0.5
#>  67:  ... loglikelihood=-2.6907767047906637  0.5
#>  68:  ... loglikelihood=-2.68931238137153    0.5
#>  69:  ... loglikelihood=-2.6878848585690607  0.5
#>  70:  ... loglikelihood=-2.6864927187315444  0.5
#>  71:  ... loglikelihood=-2.6851346175635373  0.5
#>  72:  ... loglikelihood=-2.6838092793807213  0.5
#>  73:  ... loglikelihood=-2.682515492731618   0.5
#>  74:  ... loglikelihood=-2.681252106353268   0.5
#>  75:  ... loglikelihood=-2.6800180254313566  0.5
#>  76:  ... loglikelihood=-2.6788122081382024  0.5
#>  77:  ... loglikelihood=-2.6776336624246775  0.5
#>  78:  ... loglikelihood=-2.6764814430444503  0.5
#>  79:  ... loglikelihood=-2.6753546487910365  0.5
#>  80:  ... loglikelihood=-2.6742524199299984  0.5
#>  81:  ... loglikelihood=-2.673173935810302   0.5
#>  82:  ... loglikelihood=-2.672118412640326   0.5
#>  83:  ... loglikelihood=-2.6710851014153434  0.5
#>  84:  ... loglikelihood=-2.670073285984502   0.5
#>  85:  ... loglikelihood=-2.669082281246406   0.5
#>  86:  ... loglikelihood=-2.6681114314633576  0.5
#>  87:  ... loglikelihood=-2.6671601086852044  0.5
#>  88:  ... loglikelihood=-2.666227711274505   0.5
#>  89:  ... loglikelihood=-2.6653136625254596  0.5
#>  90:  ... loglikelihood=-2.6644174093696673  0.5
#>  91:  ... loglikelihood=-2.663538421162381   0.5
#>  92:  ... loglikelihood=-2.6626761885434305  0.5
#>  93:  ... loglikelihood=-2.6618302223674837  0.5
#>  94:  ... loglikelihood=-2.661000052698734   0.5
#>  95:  ... loglikelihood=-2.6601852278655103  0.5
#>  96:  ... loglikelihood=-2.6593853135706516  0.5
#>  97:  ... loglikelihood=-2.6585998920538243  0.5
#>  98:  ... loglikelihood=-2.657828561302261   0.5
#>  99:  ... loglikelihood=-2.657070934306653   0.5
#> 100:  ... loglikelihood=-2.656326638359206   0.5

Sys.sleep(15)

# repeat data 50 times
# Obviously do not do that in te real world
data <- do.call("rbind", replicate(50, data[sample(nrow(data), 4),],
                                   simplify = FALSE))

# train model
model <- dc_train(model = paste0(wd, "/model.bin"), data = data, lang = "en")
#> Indexing events with TwoPass using cutoff of 5
#> 
#>  Computing event counts...  done. 200 events
#>  Indexing...  done.
#> Sorting and merging events... done. Reduced 200 events to 8.
#> Done indexing in 0.04 s.
#> Incorporating indexed data for training...  
#> done.
#>  Number of Event Tokens: 8
#>      Number of Outcomes: 3
#>    Number of Predicates: 39
#> ...done.
#> Computing model parameters ...
#> Performing 100 iterations.
#>   1:  ... loglikelihood=-219.72245773362195  0.335
#>   2:  ... loglikelihood=-142.88196406072186  1.0
#>   3:  ... loglikelihood=-106.34647714324419  1.0
#>   4:  ... loglikelihood=-84.4755805526073    1.0
#>   5:  ... loglikelihood=-69.8854447064422    1.0
#>   6:  ... loglikelihood=-59.478523742925404  1.0
#>   7:  ... loglikelihood=-51.69879250162104   1.0
#>   8:  ... loglikelihood=-45.67441939631744   1.0
#>   9:  ... loglikelihood=-40.87873008584082   1.0
#>  10:  ... loglikelihood=-36.9751384439237    1.0
#>  11:  ... loglikelihood=-33.73879993427859   1.0
#>  12:  ... loglikelihood=-31.01400451535654   1.0
#>  13:  ... loglikelihood=-28.68963344487269   1.0
#>  14:  ... loglikelihood=-26.684345273926482  1.0
#>  15:  ... loglikelihood=-24.937278865062638  1.0
#>  16:  ... loglikelihood=-23.402022062667086  1.0
#>  17:  ... loglikelihood=-22.04258525730783   1.0
#>  18:  ... loglikelihood=-20.830645280266495  1.0
#>  19:  ... loglikelihood=-19.743616728368874  1.0
#>  20:  ... loglikelihood=-18.7632755449706    1.0
#>  21:  ... loglikelihood=-17.8747592947954    1.0
#>  22:  ... loglikelihood=-17.065829440884418  1.0
#>  23:  ... loglikelihood=-16.326319087680915  1.0
#>  24:  ... loglikelihood=-15.647714125489163  1.0
#>  25:  ... loglikelihood=-15.02283173465646   1.0
#>  26:  ... loglikelihood=-14.445570898905896  1.0
#>  27:  ... loglikelihood=-13.91071683455445   1.0
#>  28:  ... loglikelihood=-13.413786247233787  1.0
#>  29:  ... loglikelihood=-12.950903829891859  1.0
#>  30:  ... loglikelihood=-12.518702899695455  1.0
#>  31:  ... loglikelihood=-12.114244855196077  1.0
#>  32:  ... loglikelihood=-11.734953431048229  1.0
#>  33:  ... loglikelihood=-11.378560679331837  1.0
#>  34:  ... loglikelihood=-11.043062312631536  1.0
#>  35:  ... loglikelihood=-10.726680572862678  1.0
#>  36:  ... loglikelihood=-10.427833189441525  1.0
#>  37:  ... loglikelihood=-10.145107294894169  1.0
#>  38:  ... loglikelihood=-9.877237399853936   1.0
#>  39:  ... loglikelihood=-9.623086710342822   1.0
#>  40:  ... loglikelihood=-9.381631211225155   1.0
#>  41:  ... loglikelihood=-9.151946050317257   1.0
#>  42:  ... loglikelihood=-8.933193844939263   1.0
#>  43:  ... loglikelihood=-8.724614602022037   1.0
#>  44:  ... loglikelihood=-8.525516998252673   1.0
#>  45:  ... loglikelihood=-8.335270811202298   1.0
#>  46:  ... loglikelihood=-8.153300328268987   1.0
#>  47:  ... loglikelihood=-7.979078589378311   1.0
#>  48:  ... loglikelihood=-7.812122343109141   1.0
#>  49:  ... loglikelihood=-7.651987615334983   1.0
#>  50:  ... loglikelihood=-7.498265805440456   1.0
#>  51:  ... loglikelihood=-7.3505802383570025  1.0
#>  52:  ... loglikelihood=-7.208583111589921   1.0
#>  53:  ... loglikelihood=-7.071952785502366   1.0
#>  54:  ... loglikelihood=-6.940391372714274   1.0
#>  55:  ... loglikelihood=-6.813622588838412   1.0
#>  56:  ... loglikelihood=-6.691389832125446   1.0
#>  57:  ... loglikelihood=-6.573454464104223   1.0
#>  58:  ... loglikelihood=-6.459594267122684   1.0
#>  59:  ... loglikelihood=-6.349602057937263   1.0
#>  60:  ... loglikelihood=-6.243284439257992   1.0
#>  61:  ... loglikelihood=-6.140460673512784   1.0
#>  62:  ... loglikelihood=-6.040961665111191   1.0
#>  63:  ... loglikelihood=-5.944629039218141   1.0
#>  64:  ... loglikelihood=-5.851314306537801   1.0
#>  65:  ... loglikelihood=-5.760878104891887   1.0
#>  66:  ... loglikelihood=-5.673189509487204   1.0
#>  67:  ... loglikelihood=-5.588125404729695   1.0
#>  68:  ... loglikelihood=-5.505569911277434   1.0
#>  69:  ... loglikelihood=-5.425413862753372   1.0
#>  70:  ... loglikelihood=-5.347554327171909   1.0
#>  71:  ... loglikelihood=-5.2718941686889345  1.0
#>  72:  ... loglikelihood=-5.19834164576965    1.0
#>  73:  ... loglikelihood=-5.1268100422952125  1.0
#>  74:  ... loglikelihood=-5.057217328503126   1.0
#>  75:  ... loglikelihood=-4.989485848986482   1.0
#>  76:  ... loglikelihood=-4.923542035267944   1.0
#>  77:  ... loglikelihood=-4.859316140721235   1.0
#>  78:  ... loglikelihood=-4.796741995840829   1.0
#>  79:  ... loglikelihood=-4.73575678206189    1.0
#>  80:  ... loglikelihood=-4.676300822511777   1.0
#>  81:  ... loglikelihood=-4.618317388233948   1.0
#>  82:  ... loglikelihood=-4.561752518566649   1.0
#>  83:  ... loglikelihood=-4.506554854485603   1.0
#>  84:  ... loglikelihood=-4.452675483832905   1.0
#>  85:  ... loglikelihood=-4.4000677974554 1.0
#>  86:  ... loglikelihood=-4.348687355366514   1.0
#>  87:  ... loglikelihood=-4.298491762126714   1.0
#>  88:  ... loglikelihood=-4.24944055071064    1.0
#>  89:  ... loglikelihood=-4.201495074195  1.0
#>  90:  ... loglikelihood=-4.154618404659525   1.0
#>  91:  ... loglikelihood=-4.108775238747462   1.0
#>  92:  ... loglikelihood=-4.063931809379816   1.0
#>  93:  ... loglikelihood=-4.0200558031604725  1.0
#>  94:  ... loglikelihood=-3.977116283049374   1.0
#>  95:  ... loglikelihood=-3.9350836159160716  1.0
#>  96:  ... loglikelihood=-3.893929404618151   1.0
#>  97:  ... loglikelihood=-3.853626424278687   1.0
#>  98:  ... loglikelihood=-3.8141485624627203  1.0
#>  99:  ... loglikelihood=-3.7754707629779007  1.0
#> 100:  ... loglikelihood=-3.737568973045476   1.0