A proposal for a Semantic Intelligent Document Repository Architecture


The processing of high amount of documents is a highly complex challenge, which becomes even more complicated when the goal is to extract the semantically relevant data within the documents. The large-scale processing of immense repositories of knowledge requires techniques which perform information extraction to facilitate the subsequent classification and indexing of texts. Having this into account, we propose the use of Dublin Core metadata for the classification of Software Engineering publications. Based on the information obtained from Dublin Core, we present a global repository that is populated automatically, which takes the form of an ontology which represents the distinct areas of Software Engineering knowledge inspired by SWEBOK (Software Engineering Body of Knowledge). Finally, the process of the classification of texts within the ontology is carried out in three steps: keyword analysis, processing of the document. We believe our proposal based on a linguistic text classification method, heuristics, and subsequently the intersection of the three techniques mentioned, generating more precise search results in response to user queries.

In Electronics, Robotics and Automotive Mechanics Conference (CERMA 2009)