usage.Rmd
You can build your own models or, of course use pre-trained ones, simple download any of the models available.
knitr::kable(list_models())
Language | Component | Description | Download | Link |
---|---|---|---|---|
da | Tokenizer | Trained on conllx ddt data. | da-token.bin | http://opennlp.sourceforge.net/models-1.5/da-token.bin |
da | Sentence Detector | Trained on conllx ddt data. | da-sent.bin | http://opennlp.sourceforge.net/models-1.5/da-sent.bin |
da | Part of Speech Tagger | Maxent model trained on conllx ddt data. | da-pos-maxent.bin | http://opennlp.sourceforge.net/models-1.5/da-pos-maxent.bin |
da | POS Tagger | Perceptron model trained on conllx ddt data. | da-pos-perceptron.bin | http://opennlp.sourceforge.net/models-1.5/da-pos-perceptron.bin |
de | Tokenizer | Trained on tiger data. | de-token.bin | http://opennlp.sourceforge.net/models-1.5/de-token.bin |
de | Sentence Detector | Trained on tiger data. | de-sent.bin | http://opennlp.sourceforge.net/models-1.5/de-sent.bin |
de | POS Tagger | Maxent model trained on tiger corpus. | de-pos-maxent.bin | http://opennlp.sourceforge.net/models-1.5/de-pos-maxent.bin |
de | POS Tagger | Perceptron model trained on tiger corpus. | de-pos-perceptron.bin | http://opennlp.sourceforge.net/models-1.5/de-pos-perceptron.bin |
en | Tokenizer | Trained on opennlp training data. | en-token.bin | http://opennlp.sourceforge.net/models-1.5/en-token.bin |
en | Sentence Detector | Trained on opennlp training data. | en-sent.bin | http://opennlp.sourceforge.net/models-1.5/en-sent.bin |
en | POS Tagger | Maxent model with tag dictionary. | en-pos-maxent.bin | http://opennlp.sourceforge.net/models-1.5/en-pos-maxent.bin |
en | POS Tagger | Perceptron model with tag dictionary. | en-pos-perceptron.bin | http://opennlp.sourceforge.net/models-1.5/en-pos-perceptron.bin |
en | Name Finder | Date name finder model. | en-ner-date.bin | http://opennlp.sourceforge.net/models-1.5/en-ner-date.bin |
en | Name Finder | Location name finder model. | en-ner-location.bin | http://opennlp.sourceforge.net/models-1.5/en-ner-location.bin |
en | Name Finder | Money name finder model. | en-ner-money.bin | http://opennlp.sourceforge.net/models-1.5/en-ner-money.bin |
en | Name Finder | Organization name finder model. | en-ner-organization.bin | http://opennlp.sourceforge.net/models-1.5/en-ner-organization.bin |
en | Name Finder | Percentage name finder model. | en-ner-percentage.bin | http://opennlp.sourceforge.net/models-1.5/en-ner-percentage.bin |
en | Name Finder | Person name finder model. | en-ner-person.bin | http://opennlp.sourceforge.net/models-1.5/en-ner-person.bin |
en | Name Finder | Time name finder model. | en-ner-time.bin | http://opennlp.sourceforge.net/models-1.5/en-ner-time.bin |
en | Chunker | Trained on conll2000 shared task data. | en-chunker.bin | http://opennlp.sourceforge.net/models-1.5/en-chunker.bin |
en | Parser | en-parser-chunking.bin | http://opennlp.sourceforge.net/models-1.5/en-parser-chunking.bin | |
en | Coreference | coref | http://opennlp.sourceforge.net/models-1.5/coref | |
es | Name Finder | Person name finder model. Trained on conll02 shared task data. | es-ner-person.bin | http://opennlp.sourceforge.net/models-1.5/es-ner-person.bin |
es | Name Finder | Organization name finder model. Trained on conll02 shared task data. | es-ner-organization.bin | http://opennlp.sourceforge.net/models-1.5/es-ner-organization.bin |
es | Name Finder | Location name finder model. Trained on conll02 shared task data. | es-ner-location.bin | http://opennlp.sourceforge.net/models-1.5/es-ner-location.bin |
es | Name Finder | Misc name finder model. Trained on conll02 shared task data. | es-ner-misc.bin | http://opennlp.sourceforge.net/models-1.5/es-ner-misc.bin |
nl | Tokenizer | Trained on conllx alpino data. | nl-token.bin | http://opennlp.sourceforge.net/models-1.5/nl-token.bin |
nl | Sentence Detector | Trained on conllx alpino data. | nl-sent.bin | http://opennlp.sourceforge.net/models-1.5/nl-sent.bin |
nl | Name Finder | Person name finder model. Trained on conll02 shared task data. | nl-ner-person.bin | http://opennlp.sourceforge.net/models-1.5/nl-ner-person.bin |
nl | Name Finder | Organization name finder model. Trained on conll02 shared task data. | nl-ner-organization.bin | http://opennlp.sourceforge.net/models-1.5/nl-ner-organization.bin |
nl | Name Finder | Location name finder model. Trained on conll02 shared task data. | nl-ner-location.bin | http://opennlp.sourceforge.net/models-1.5/nl-ner-location.bin |
nl | Name Finder | Misc name finder model. Trained on conll02 shared task data. | nl-ner-misc.bin | http://opennlp.sourceforge.net/models-1.5/nl-ner-misc.bin |
nl | POS Tagger | Maxent model trained on conllx alpino data. | nl-pos-maxent.bin | http://opennlp.sourceforge.net/models-1.5/nl-pos-maxent.bin |
nl | POS Tagger | Perceptron model trained on conllx alpino data. | nl-pos-perceptron.bin | http://opennlp.sourceforge.net/models-1.5/nl-pos-perceptron.bin |
pt | Tokenizer | Trained on conllx bosque data. | pt-token.bin | http://opennlp.sourceforge.net/models-1.5/pt-token.bin |
pt | Sentence Detector | Trained on conllx bosque data. | pt-sent.bin | http://opennlp.sourceforge.net/models-1.5/pt-sent.bin |
pt | POS Tagger | Maxent model trained on conllx bosque data. | pt-pos-maxent.bin | http://opennlp.sourceforge.net/models-1.5/pt-pos-maxent.bin |
pt | POS Tagger | Perceptron model trained on conllx bosque data. | pt-pos-perceptron.bin | http://opennlp.sourceforge.net/models-1.5/pt-pos-perceptron.bin |
se | Tokenizer | Trained on conllx talbanken05 data. | se-token.bin | http://opennlp.sourceforge.net/models-1.5/se-token.bin |
se | Sentence Detector | Trained on conllx talbanken05 data. | se-sent.bin | http://opennlp.sourceforge.net/models-1.5/se-sent.bin |
se | POS Tagger | Maxent model trained on conllx talbanken05 data. | se-pos-maxent.bin | http://opennlp.sourceforge.net/models-1.5/se-pos-maxent.bin |
se | POS Tagger | Perceptron model trained on conllx talbanken05 data. | se-pos-perceptron.bin | http://opennlp.sourceforge.net/models-1.5/se-pos-perceptron.bin |
A currently basic tagger to easily tag training data to train a token name finder (tnf_train
).
# Manually tagged
manual <- paste("This organisation is called the <START:wef> World Economic Forum <END>",
"It is often referred to as <START:wef> Davos <END> or the <START:wef> WEF <END> .")
# Create untagged string
data <- paste("This organisation is called the World Economic Forum",
"It is often referred to as Davos or the WEF.")
# tag string
auto <- tag_docs(data, "WEF", "wef")
auto <- tag_docs(auto, "World Economic Forum", "wef")
auto <- tag_docs(auto, "Davos", "wef")
identical(manual, auto)
You will need considerable training data for the name extraction; 15’000 sentences. However, this does not mean 15’000 tagged sentences, this means 15’000 sentences representative of the documents you will have to extract names from.
Including sentences that do not contain tagged names reduces false positives; the model learns what to extract as much as it learns what not to extract.
In order to train a decent document classifier you are going to need 5’000 classified documents as training data with a bare minimum of 5 documents per category.
library(decipher)
# get working directory
# need to pass full path
wd <- getwd()
data <- data.frame(
class = c("Sport", "Business", "Sport", "Sport", "Business", "Politics", "Politics", "Politics"),
doc = c("Football, tennis, golf and, bowling and, score.",
"Marketing, Finance, Legal and, Administration.",
"Tennis, Ski, Golf and, gym and, match.",
"football, climbing and gym.",
"Marketing, Business, Money and, Management.",
"This document talks politics and Donal Trump.",
"Donald Trump is the President of the US, sadly.",
"Article about politics and president Trump.")
)
# Error not enough data
model <- dc_train(model = paste0(wd, "/model.bin"), data = data, lang = "en")
#> Indexing events with TwoPass using cutoff of 5
#>
#> Computing event counts... done. 8 events
#> Indexing... Dropped event Sport:[bow=football,, bow=climbing, bow=and, bow=gym.]
#> Dropped event Politics:[bow=This, bow=document, bow=talks, bow=politics, bow=and, bow=Donal, bow=Trump.]
#> Dropped event Politics:[bow=Donald, bow=Trump, bow=is, bow=the, bow=President, bow=of, bow=the, bow=US,, bow=sadly.]
#> Dropped event Politics:[bow=Article, bow=about, bow=politics, bow=and, bow=president, bow=Trump.]
#> done.
#> Sorting and merging events... done. Reduced 4 events to 2.
#> Done indexing in 0.02 s.
#> Incorporating indexed data for training...
#> done.
#> Number of Event Tokens: 2
#> Number of Outcomes: 3
#> Number of Predicates: 1
#> ...done.
#> Computing model parameters ...
#> Performing 100 iterations.
#> 1: ... loglikelihood=-4.394449154672439 0.5
#> 2: ... loglikelihood=-3.8421887157189216 0.5
#> 3: ... loglikelihood=-3.6154430125057786 0.5
#> 4: ... loglikelihood=-3.465596155326559 0.5
#> 5: ... loglikelihood=-3.357105109210417 0.5
#> 6: ... loglikelihood=-3.274527687900086 0.5
#> 7: ... loglikelihood=-3.209347253794932 0.5
#> 8: ... loglikelihood=-3.1564421346077856 0.5
#> 9: ... loglikelihood=-3.1125409493265437 0.5
#> 10: ... loglikelihood=-3.075452811722954 0.5
#> 11: ... loglikelihood=-3.043653058837632 0.5
#> 12: ... loglikelihood=-3.0160464644888547 0.5
#> 13: ... loglikelihood=-2.991825031705967 0.5
#> 14: ... loglikelihood=-2.970378973443104 0.5
#> 15: ... loglikelihood=-2.951238991461673 0.5
#> 16: ... loglikelihood=-2.934037698521228 0.5
#> 17: ... loglikelihood=-2.918483147064785 0.5
#> 18: ... loglikelihood=-2.904340240566901 0.5
#> 19: ... loglikelihood=-2.8914174106886295 0.5
#> 20: ... loglikelihood=-2.879556893058896 0.5
#> 21: ... loglikelihood=-2.868627512822131 0.5
#> 22: ... loglikelihood=-2.858519252812046 0.5
#> 23: ... loglikelihood=-2.8491391089469005 0.5
#> 24: ... loglikelihood=-2.8404078891499767 0.5
#> 25: ... loglikelihood=-2.8322577133850033 0.5
#> 26: ... loglikelihood=-2.8246300412382315 0.5
#> 27: ... loglikelihood=-2.8174741010412983 0.5
#> 28: ... loglikelihood=-2.8107456278868233 0.5
#> 29: ... loglikelihood=-2.8044058416105955 0.5
#> 30: ... loglikelihood=-2.798420612901019 0.5
#> 31: ... loglikelihood=-2.7927597781512117 0.5
#> 32: ... loglikelihood=-2.787396572848336 0.5
#> 33: ... loglikelihood=-2.7823071601298786 0.5
#> 34: ... loglikelihood=-2.777470236275584 0.5
#> 35: ... loglikelihood=-2.7728666988024067 0.5
#> 36: ... loglikelihood=-2.7684793658127664 0.5
#> 37: ... loglikelihood=-2.7642927375468878 0.5
#> 38: ... loglikelihood=-2.760292792877533 0.5
#> 39: ... loglikelihood=-2.7564668148843223 0.5
#> 40: ... loglikelihood=-2.7528032407468404 0.5
#> 41: ... loglikelihood=-2.74929153206948 0.5
#> 42: ... loglikelihood=-2.7459220624478426 0.5
#> 43: ... loglikelihood=-2.742686019645536 0.5
#> 44: ... loglikelihood=-2.7395753202010997 0.5
#> 45: ... loglikelihood=-2.7365825346502968 0.5
#> 46: ... loglikelihood=-2.7337008218468335 0.5
#> 47: ... loglikelihood=-2.730923871108338 0.5
#> 48: ... loglikelihood=-2.72824585111486 0.5
#> 49: ... loglikelihood=-2.7256613646526846 0.5
#> 50: ... loglikelihood=-2.723165408433481 0.5
#> 51: ... loglikelihood=-2.720753337333093 0.5
#> 52: ... loglikelihood=-2.7184208324896924 0.5
#> 53: ... loglikelihood=-2.7161638727811193 0.5
#> 54: ... loglikelihood=-2.7139787092686083 0.5
#> 55: ... loglikelihood=-2.711861842250955 0.5
#> 56: ... loglikelihood=-2.709810000621419 0.5
#> 57: ... loglikelihood=-2.7078201232605617 0.5
#> 58: ... loglikelihood=-2.7058893422331587 0.5
#> 59: ... loglikelihood=-2.7040149675871263 0.5
#> 60: ... loglikelihood=-2.702194473577981 0.5
#> 61: ... loglikelihood=-2.7004254861643284 0.5
#> 62: ... loglikelihood=-2.6987057716388008 0.5
#> 63: ... loglikelihood=-2.6970332262752077 0.5
#> 64: ... loglikelihood=-2.695405866886859 0.5
#> 65: ... loglikelihood=-2.6938218222032515 0.5
#> 66: ... loglikelihood=-2.692279324983063 0.5
#> 67: ... loglikelihood=-2.6907767047906637 0.5
#> 68: ... loglikelihood=-2.68931238137153 0.5
#> 69: ... loglikelihood=-2.6878848585690607 0.5
#> 70: ... loglikelihood=-2.6864927187315444 0.5
#> 71: ... loglikelihood=-2.6851346175635373 0.5
#> 72: ... loglikelihood=-2.6838092793807213 0.5
#> 73: ... loglikelihood=-2.682515492731618 0.5
#> 74: ... loglikelihood=-2.681252106353268 0.5
#> 75: ... loglikelihood=-2.6800180254313566 0.5
#> 76: ... loglikelihood=-2.6788122081382024 0.5
#> 77: ... loglikelihood=-2.6776336624246775 0.5
#> 78: ... loglikelihood=-2.6764814430444503 0.5
#> 79: ... loglikelihood=-2.6753546487910365 0.5
#> 80: ... loglikelihood=-2.6742524199299984 0.5
#> 81: ... loglikelihood=-2.673173935810302 0.5
#> 82: ... loglikelihood=-2.672118412640326 0.5
#> 83: ... loglikelihood=-2.6710851014153434 0.5
#> 84: ... loglikelihood=-2.670073285984502 0.5
#> 85: ... loglikelihood=-2.669082281246406 0.5
#> 86: ... loglikelihood=-2.6681114314633576 0.5
#> 87: ... loglikelihood=-2.6671601086852044 0.5
#> 88: ... loglikelihood=-2.666227711274505 0.5
#> 89: ... loglikelihood=-2.6653136625254596 0.5
#> 90: ... loglikelihood=-2.6644174093696673 0.5
#> 91: ... loglikelihood=-2.663538421162381 0.5
#> 92: ... loglikelihood=-2.6626761885434305 0.5
#> 93: ... loglikelihood=-2.6618302223674837 0.5
#> 94: ... loglikelihood=-2.661000052698734 0.5
#> 95: ... loglikelihood=-2.6601852278655103 0.5
#> 96: ... loglikelihood=-2.6593853135706516 0.5
#> 97: ... loglikelihood=-2.6585998920538243 0.5
#> 98: ... loglikelihood=-2.657828561302261 0.5
#> 99: ... loglikelihood=-2.657070934306653 0.5
#> 100: ... loglikelihood=-2.656326638359206 0.5
Sys.sleep(15)
# repeat data 50 times
# Obviously do not do that in te real world
data <- do.call("rbind", replicate(50, data[sample(nrow(data), 4),],
simplify = FALSE))
# train model
model <- dc_train(model = paste0(wd, "/model.bin"), data = data, lang = "en")
#> Indexing events with TwoPass using cutoff of 5
#>
#> Computing event counts... done. 200 events
#> Indexing... done.
#> Sorting and merging events... done. Reduced 200 events to 8.
#> Done indexing in 0.04 s.
#> Incorporating indexed data for training...
#> done.
#> Number of Event Tokens: 8
#> Number of Outcomes: 3
#> Number of Predicates: 39
#> ...done.
#> Computing model parameters ...
#> Performing 100 iterations.
#> 1: ... loglikelihood=-219.72245773362195 0.335
#> 2: ... loglikelihood=-142.88196406072186 1.0
#> 3: ... loglikelihood=-106.34647714324419 1.0
#> 4: ... loglikelihood=-84.4755805526073 1.0
#> 5: ... loglikelihood=-69.8854447064422 1.0
#> 6: ... loglikelihood=-59.478523742925404 1.0
#> 7: ... loglikelihood=-51.69879250162104 1.0
#> 8: ... loglikelihood=-45.67441939631744 1.0
#> 9: ... loglikelihood=-40.87873008584082 1.0
#> 10: ... loglikelihood=-36.9751384439237 1.0
#> 11: ... loglikelihood=-33.73879993427859 1.0
#> 12: ... loglikelihood=-31.01400451535654 1.0
#> 13: ... loglikelihood=-28.68963344487269 1.0
#> 14: ... loglikelihood=-26.684345273926482 1.0
#> 15: ... loglikelihood=-24.937278865062638 1.0
#> 16: ... loglikelihood=-23.402022062667086 1.0
#> 17: ... loglikelihood=-22.04258525730783 1.0
#> 18: ... loglikelihood=-20.830645280266495 1.0
#> 19: ... loglikelihood=-19.743616728368874 1.0
#> 20: ... loglikelihood=-18.7632755449706 1.0
#> 21: ... loglikelihood=-17.8747592947954 1.0
#> 22: ... loglikelihood=-17.065829440884418 1.0
#> 23: ... loglikelihood=-16.326319087680915 1.0
#> 24: ... loglikelihood=-15.647714125489163 1.0
#> 25: ... loglikelihood=-15.02283173465646 1.0
#> 26: ... loglikelihood=-14.445570898905896 1.0
#> 27: ... loglikelihood=-13.91071683455445 1.0
#> 28: ... loglikelihood=-13.413786247233787 1.0
#> 29: ... loglikelihood=-12.950903829891859 1.0
#> 30: ... loglikelihood=-12.518702899695455 1.0
#> 31: ... loglikelihood=-12.114244855196077 1.0
#> 32: ... loglikelihood=-11.734953431048229 1.0
#> 33: ... loglikelihood=-11.378560679331837 1.0
#> 34: ... loglikelihood=-11.043062312631536 1.0
#> 35: ... loglikelihood=-10.726680572862678 1.0
#> 36: ... loglikelihood=-10.427833189441525 1.0
#> 37: ... loglikelihood=-10.145107294894169 1.0
#> 38: ... loglikelihood=-9.877237399853936 1.0
#> 39: ... loglikelihood=-9.623086710342822 1.0
#> 40: ... loglikelihood=-9.381631211225155 1.0
#> 41: ... loglikelihood=-9.151946050317257 1.0
#> 42: ... loglikelihood=-8.933193844939263 1.0
#> 43: ... loglikelihood=-8.724614602022037 1.0
#> 44: ... loglikelihood=-8.525516998252673 1.0
#> 45: ... loglikelihood=-8.335270811202298 1.0
#> 46: ... loglikelihood=-8.153300328268987 1.0
#> 47: ... loglikelihood=-7.979078589378311 1.0
#> 48: ... loglikelihood=-7.812122343109141 1.0
#> 49: ... loglikelihood=-7.651987615334983 1.0
#> 50: ... loglikelihood=-7.498265805440456 1.0
#> 51: ... loglikelihood=-7.3505802383570025 1.0
#> 52: ... loglikelihood=-7.208583111589921 1.0
#> 53: ... loglikelihood=-7.071952785502366 1.0
#> 54: ... loglikelihood=-6.940391372714274 1.0
#> 55: ... loglikelihood=-6.813622588838412 1.0
#> 56: ... loglikelihood=-6.691389832125446 1.0
#> 57: ... loglikelihood=-6.573454464104223 1.0
#> 58: ... loglikelihood=-6.459594267122684 1.0
#> 59: ... loglikelihood=-6.349602057937263 1.0
#> 60: ... loglikelihood=-6.243284439257992 1.0
#> 61: ... loglikelihood=-6.140460673512784 1.0
#> 62: ... loglikelihood=-6.040961665111191 1.0
#> 63: ... loglikelihood=-5.944629039218141 1.0
#> 64: ... loglikelihood=-5.851314306537801 1.0
#> 65: ... loglikelihood=-5.760878104891887 1.0
#> 66: ... loglikelihood=-5.673189509487204 1.0
#> 67: ... loglikelihood=-5.588125404729695 1.0
#> 68: ... loglikelihood=-5.505569911277434 1.0
#> 69: ... loglikelihood=-5.425413862753372 1.0
#> 70: ... loglikelihood=-5.347554327171909 1.0
#> 71: ... loglikelihood=-5.2718941686889345 1.0
#> 72: ... loglikelihood=-5.19834164576965 1.0
#> 73: ... loglikelihood=-5.1268100422952125 1.0
#> 74: ... loglikelihood=-5.057217328503126 1.0
#> 75: ... loglikelihood=-4.989485848986482 1.0
#> 76: ... loglikelihood=-4.923542035267944 1.0
#> 77: ... loglikelihood=-4.859316140721235 1.0
#> 78: ... loglikelihood=-4.796741995840829 1.0
#> 79: ... loglikelihood=-4.73575678206189 1.0
#> 80: ... loglikelihood=-4.676300822511777 1.0
#> 81: ... loglikelihood=-4.618317388233948 1.0
#> 82: ... loglikelihood=-4.561752518566649 1.0
#> 83: ... loglikelihood=-4.506554854485603 1.0
#> 84: ... loglikelihood=-4.452675483832905 1.0
#> 85: ... loglikelihood=-4.4000677974554 1.0
#> 86: ... loglikelihood=-4.348687355366514 1.0
#> 87: ... loglikelihood=-4.298491762126714 1.0
#> 88: ... loglikelihood=-4.24944055071064 1.0
#> 89: ... loglikelihood=-4.201495074195 1.0
#> 90: ... loglikelihood=-4.154618404659525 1.0
#> 91: ... loglikelihood=-4.108775238747462 1.0
#> 92: ... loglikelihood=-4.063931809379816 1.0
#> 93: ... loglikelihood=-4.0200558031604725 1.0
#> 94: ... loglikelihood=-3.977116283049374 1.0
#> 95: ... loglikelihood=-3.9350836159160716 1.0
#> 96: ... loglikelihood=-3.893929404618151 1.0
#> 97: ... loglikelihood=-3.853626424278687 1.0
#> 98: ... loglikelihood=-3.8141485624627203 1.0
#> 99: ... loglikelihood=-3.7754707629779007 1.0
#> 100: ... loglikelihood=-3.737568973045476 1.0