Home Articles Information extraction for geographical texts in Hindi languages

Information extraction for geographical texts in Hindi languages

Kamlesh Dutta
[email protected]

Ankit Jain, Amit Kumar and R.K.Dutta

National Institute of Technology


Extended abstract
In this paper we present a statistical approach of knowledge extraction from spatial information represented in Hindi languages. The location data is derived from the various available corpuses and combined to assure best performance for domain independent named-entity recognition in text. We shall present some experiments illustrating the accuracy of the method.

We present a statistical method for knowledge extraction from spatial information. The approach rely only on the corpus available to us and does not require any hand-crafting. The first section describes the algorithm used, the probabilistic model behind it, its implementation, the evaluation and the results obtained


  1. Read source and some normal text.
  2. Perform frequency analysis.
  3. Filter out words having frequency
    • Lesser than a minimum threshold value
    • Greater than a maximum threshold value
  4. Obtain keywords.
  5. Apply graph displaying algorithm.
  6. Display graph

The input to the system is a source file containing the data from which useful information is to be extracted out. We also use another list of commonly occurring words .This list itself is prepared automatically by counting the frequency of occurrence of each word. In both the cases we remove words having frequency more than an upper limit or below a lower limit. Thus we are able to remove words that are having no significance for e.g. commonly occurring words such as (ka, kI, hO, AaOr) and words that have little importance with the main content of the source for e.g. some specific information such as a date which may not have any significance. If however the date occurs repeatedly then we retain this information.

The system is developed using VC++. A graphical output is generated in a window. We have utilized MFC's for this purpose. As can be seen from the above test results the system has successfully extracted information in this case, however when the size of the input was increased few spurious links arose. The vast majority of the links accurately depicted the relationship between entities. The system was able to automatically extract the important information from no prior knowledge.