PROJECT PROPOSAL

Objectives of the project

The main objective of the system is to categorize word based on it use such as person, location, organization dates, quantities, numeric expressions, email and title (profession)

Problem statement

Named entity recognition is an important part of various natural language applications. It is employed in a preprocessing stage which extracts proper nouns required by many natural language applications to improve their performance. If we can extract needed information fast and efficiently. we can save a lot of time and resources.

 

Scope for the project

The scope of the project is to develop a java based system which can categorize a given set of text to a predefined category such as person, location, time, number, title and email.

  Overview of Existing Systems

Notable NER platforms include:

  • GATE
  • OpenNLP
  • SpaCy

These softwares are sophisticated and needs high expertise for using them. A non IT person can’t us them without training. Most of them lack user interface and most instructions are given in command line interface.

Target Work Force

This system to be developed can be used by wide range of audiences. From company recruiting employees to non IT worked to exact information from biodatas, transcripts, research papers, and all sorts of text documents. There is no need to be an expert in IT to use our system.

Methodology

Automatic Text Classification involves assigning a text document to a set of pre-defined classes automatically, using a machine learning technique. The classification is usually done on the basis of significant words or features extracted from the text document.

The system involves five phases (Fig. 1.)

I. Training set of text documents

ii. Pre-processing

All the stop words and the code which is repeatedly coming are also removed. After this process there will be only plain text, the plain text has many English words, verbs, adjectives which are not of use. This Plain Text is processed and English Keywords are Removed using Database stored Keywords.

iii. Extract features

iv. Apply machine learning technique remaining text

The classifier will classify the featured word and will make a class in database with its information.

v. Recognition

Database

We will be using the data set of the stanfordcoreNLP developed by the university of stanford USA. We use this dataset because it contains large amount of data and it is open sourced and free to use Results

The output will be categorized into relevant component. The user can select the category and it show the output.

Technologies and tools

For this project we will be using the java libraries of stanford core nlp project. we are using dataset from same library too. we are going to develop the system using NetBeans IDE.

The versions are given below

Future Modifications

We have an idea to develop a new layout that will enable us to add a text file, read it and categorize it. Further updating we can download the categorized as a spreadsheet.

References

  • Incorporating Non-local Information into Information Extraction Systems by Gibbs SamplingJenny Rose Finkel, Trond Grenager, and Christopher Manning Computer Science Department
  • https://en.wikipedia.org/wiki/Named-entity_recognition
  • A survey of named entity recognition and classification

David Nadeau, Satoshi Sekine

National Research Council Canada / New York University

  • A Survey on Deep Learning for Named Entity Recognition

Jing Li, Aixin Sun, Jianglei Han, and Chenliang Li