...
Loading the Wikinews Corpus
Get the Wikinews dumps
The wikinews dumps can be downloaded from Wikipedia, here is some general information about the dumps:
http://meta.wikimedia.org/wiki/Data_dumps
...
After the download decompress the file, e.g. with bunzip2 on Linux.
Convert the dump files to Apache UIMA XMI files
The current version of the parser only works well for the English wikinews dump. Contributions to fix this for other
languages are very welcome.
Get and compile the Wikinews Importer
Checkout the wikinews parser:
svn co https://svn.apache.org/repos/asf/opennlp/sandbox/wikinews-importer/
And compile it with this command: mvn clean install
Parse the XML articles
The xml file can now be parsed:
bin/converter /home/blue/Downloads/enwikinews-20120727-pages-articles.xml articles
This command will take a while to run, when its done there is one xmi file for each
article in the articles folder.
Load the articles to the Corpus Server
To load the articles in the corpus server a corpus must be created first.
This is done with the corpus-server-tools.
Get and compile the Corpus Server Tools
Checkout the tools
svn co https://svn.apache.org/repos/asf/opennlp/sandbox/corpus-server-tools
and build them with mvn clean install.
Create a new Corpus
Now create the wikinews corpus in the previously started Corpus Server
can be created:
bin/cs-tools CreateCorpus http://localhost:8080/rest/corpora enwikinews ../wikinews-importer/samples/TypeSystem.xml ../wikinews-importer/samples/wikinews.xml
The response code should be 204. If something goes wrong you get an HTTP error code.
Import the articles to the created corpus
And import the article files:
bin/cs-tools CASImporter http://localhost:8080/rest/corpora/enwikinews ../wikinews-importer/articles
...