Free and Latest article publishing for websites and ezines!

Research on Rough Sets Based Chinese Language Modeling and Its Applications

Natural language modeling provides the foundation of processing and applying language information by computer. Though statistical language models have been successfully applied in natural language prosessing (NLP) field, we still face to the problem of increasing the efficient and accurate for linguistic knowledge mining and redundant information pruning. With the advantages in addressing problems of information redundant, contradiction and vague, rough sets techniques have been applied successfully in Knowledge Discovery in Database (KDD). By introducing rough sets techniques, this paper is concentrated on the methods and model for linguistic knowledge mining from large scale corpus and the application of constructed models in natural language processing. This paper is composed of four parts: At first, to address problems in Chinese pinyin-to-character conversion, this paper provides the structuralizing method for textural information. Based on it, a linguistic knowledge discovering model is constructed for the mining of Chinese PTC rules from large scale corpus. The implementation method of constructing the model is also provided and the model performance is evaluation by experiments. To reduct the rule base according to characteristics of applications, the mined rules are application-dependent. In spite of it, since all rules are mined automatically, it is still easy to port this model to other NLP applications. Second, to address the issues of long-distance-constraint efficiently, the combination of rough rules with classical statistical language models is researched. Considering the characteristics of storage-limited applications, rough rules are firstly combined with character-based n-gram models and their performances are evaluated experimentally. Then, under the maximum entropy (ME) framework, rough rules are combined with word-based tri-grams for general applications. The experiment result shows good performance gain for this combination. Third, word sense quantization model is the foundation of word sense disambiguating and sense similarity computing. In this paper, a word space is firstly constructed through statistic of corpus. Then the quantization model is constructed by mapping feature words into this word space. To decrease the time complexity of sense similarity computing, attribute reduct algorithm is introduced to fulfil the task of word space reduction and axis words selecting. This part also provides method of discretizing attribute values. At last, this model is auto-evaluated by the method of constructing pseudoword words and the results show good performance in word sense disambiguation task. Fourth, the popularization of Internet boosts the requirement for efficient and accurate methods of information acquisition. As an important component among information acquisition methods, the requirement for high quality auto-summarization system also becomes more urgency. To fulfill this requirement, an adapted Dotplot methods based on word sense quantization model is firstly provided to address the problem of subtopic segmentation. Then a multi-knowledge-sources-integrating (MKSI) model is constructed, which combines the results of rhetorical structure analyse, text content structure analyse and subtopic segmentation together and provides clues for abstract sentences extracting. At last, a auto-evaluation system is provided for the performance evaluation a text summarization system, based on it, the model parameters optimization is fulfilled by genetic algotithm. Our experiment result shows that the MKSI model can keep the logical of original text and then generates good quality abstact. In addition, the evaluation of model performance also shows that there is not a strong dependency between the model performance and the scale of training corpus, which is very helpful for the auto-summarization field since the training corpus must be constructed manually.

Recommended Articles from the IT Science Category:

Most Viewed ScienceArticles in the IT Science Category:

  1. Channel Model Simulation and Spread Spectrum OFDM for HF Communication
  2. Study on the Political Function of Mass Media
  3. Research on Algorithms of GPU-Based 3D Medical Image Processing
  4. Study on Radar Tracking and Discrimination for Ballistic Missiles
  5. Research on QoS Based Multicast Routing Protocols in Mobile Ad Hoc Networks
  6. Study on Robot Joint Based on Reversing Ball Screw Mechanism
  7. Research on Real Time Pulse Train Deinterleaving for Radar Intercept System
  8. Reaearch on Optimization Problem of Manufacturing Process in a Discrete Manufacturing Industry
  9. Study of Parallel FDTD Algorithm and EM Scattering in Layered Half-space
  10. Spatial Three Degree-of-Freedom Parallel Mechanisms: Configurations, Performances and Applications
  11. Channel Estimation in MIMO-OFDM Wireless Communication System
  12. Preparation and Investigation of p-ZnO Film and ZnO Light Emitting Device
  13. The Application and Study of Electrochemical Biosensors Based on Nanomaterials
  14. A Study of Space-Frequency Coding and Signal Detection in MIMO-OFDM Systems
  15. Research on Optical Fiber Sensor Based on Metal Nanoparticles


© 2004-2009 Latest-Science-Articles.com - All Rights Reserved Worldwide.