Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Choose a mirror near you and then go to "Database backup dumps". The file you need to download is
called like this (the date will be different): enwikinews-20120727-pages-articles.xml.bz2

After the download decompress the file, e.g. with bunzip2 on Linux.

The current version of the parser only works well for the English wikinews dump. Contributions to fix this for other
languages are very welcome.

...

And compile it with this command: mvn clean install

The xml file can now be parsed:
bin/converter /home/blue/Downloads/enwikinews-20120727-pages-articles.xml articles

This command will take a while to run, when its done there is one xmi file for each
article in the articles folder.

To load the articles in the corpus server a corpus must be created first.
This is done with the corpus-server-tools.

Checkout the tools
svn co https://svn.apache.org/repos/asf/opennlp/sandbox/corpus-server-tools
and build them with mvn clean install.

Now create the wikinews corpus in the previously started Corpus Server
can be created:
bin/cs-tools CreateCorpus http://localhost:8080/rest/corporaImage Added enwikinews ../wikinews-importer/samples/TypeSystem.xml ../wikinews-importer/samples/wikinews.xml

And import the article files:
bin/cs-tools CASImporter http://localhost:8080/rest/corpora/enwikinewsImage Added ../wikinews-importer/articles

Opening an article in the Cas Editor

...