Just a way to convert agiga
data into a processors
Document
.
Why spend the time and resources parsing and annotating over 183 million sentences when it has already been done?
import org.clulab.agiga
// build a processors.Document
val doc = agiga.toDocument("path/to/agiga/xml/ltw_eng_200705.xml.gz")
Everything is configured in the application.conf
file.
-
Change the
view
property to "lemmas" -
Change the
inputDir
property to wherever your copy ofagiga
is nestled on your disk -
Change the
outputDir
property to wherever you want your compressed of the lemmatized English Gigaword to be written -
(Optional) Change the
nthreads
property to the maximum number of threads you prefer to use for parallelization.
All that's left is to run AgigaReader
:
sbt "runMain sem.AgigaReader"
Value | Description |
---|---|
"words" | word form of each token |
"lemmas" | lemma form of each token |
"tags" | PoS tag of each token |
"entities" | NE labels of each token |
"deps" | <word form of head>_<relation>_<word form of dependent> |
"lemma-deps" | <lemmatized head>_<relation>_<lemmatized dependent> |
"tag-deps" | <pos tag of head>_<relation>_<pos tag of dependent> |
"entity-deps" | <NE label of head>_<relation>_<NE label of dependent> |
- Add output options for dependencies using the DFS ordering described in "Higher-order Lexical Semantic Models for Non-factoid Answer Reranking"