Skip to content

orlando21/nlp_project

Repository files navigation

Java NLP Project Using Apache OpenNLP

This repository contains Java demonstration code for implementing Apache OpenNLP 1.5.3. I wrote this code as a first experience with Natural Language Processing. Note there are many different ways of implementing NLP, but the Open NLP library is a good place to start for beginners.

This repository was uploaded with all necessary Java, OpenNLP, and Maven libraries and is more or less ready to go. Refer to the Apache OpenNLP link above for info on requirements and versions.

Goals

The goal was to read an input.txt text file (small size because running these classes on longer text can take a loooong time), and write the NLP analysis to an output.txt file. Also the appropriate OpenNLP English-language model was used for analysis. This code does the following:

  • Reads the sentences of input.txt
  • Writes the sentences as a text string to console
  • Writes the number of sentences found to both console and to output.txt
  • Writes the number of tokens (words, punctuation, numbers, etc.) found to both console and to output.txt/li>
  • Writes the proper names of individuals found to console/li>
  • Writes POS tags to output.txt
  • Writes chunks to output.txt
  • Writes parse results to output.txt

This is actually a small subset of what one can do with these classes. For example, you can parse sentences for grammatical structure and much more.

Features

This code contains the following features:

  • Sentence detector
  • Sentence tokenizer
  • Name finder to detect named entities
  • Part-ofspeech tagger
  • Chunker
  • Parser

As mentioned above, the appropriate OpenNLP English-language model is used to do these tasks.

Structure

The main[] method is implemented in OpenNLP_App.java, which executes the following classes:

  • ReadInput.java for reading the contents of input.txt
  • SentenceDetect.java for detecting sentence boundaries
  • SentenceTokenize.java for detecting words and punctuation
  • NamedEntityRecognition.java for finding names in sentences
  • TaggerPOS.java for assigning English grammar categories to detected words
  • SentenceChunk.java for organizing sentences into chunks, based on detected tokens
  • SentenceParse.java for iteratively parsing a sentence according to parts of speech
  • PrintOutput.java for writing results to output.txt

Other Features

This code uses Maven dependency. See the OpenNLP documentation for more information.

About

A personal project to learn about Apache OpenNLP

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published