----------------------------------------------------------- ----------------------------------------------------------- JoBimText A framework for distributional semantics ----------------------------------------------------------- Created: 20.01.2015 by Martin Riedl ----------------------------------------------------------- 1. Introduction 2. Models 2.1 Download of Models 2.2 Import of models into a MySql database 2.3 Configuration for the API 3. Using the API 3.1 Connecting to the database 3.2 Functions of the API 1. Introduction JoBimText [1] is an Apache licensed software for automatic text expansion using distributional semantics. The framework can be used to compute distributional similarities between terms. For the computation of similarities context features are used. 2. Models Here, the included API is described. It can be used to access pre-computed models. For the German language two models have been computed. Both models are computed based on 70 million German newspaper sentences from the Leipzig Corpora Collection (http://corpora.uni-leipzig.de/). The similarities of the first model are computed using left and right neighbors of the target word, which we call a trigram model. The second model is computed using dependency parses extracted using the Mate Parser (http://code.google.com/p/mate-tools/). Both models consist of following contents: - word counts - context feature counts - word - context feature counts - significance scores (LMI) between word and context features - sense clusters for the most frequent terms, computed using Chinese Whispers [2] 2.1 Download of Models The models can be downloaded at: - http://sourceforge.net/projects/jobimtext/files/data/models/de_70M_trigram/ (using neighboring words as features) - http://sourceforge.net/projects/jobimtext/files/data/models/de_70M_mateparser/ (dependency parses as features) To access the API the model the model can be loaded into a database (we show how to use it with MySql). 2.2 Import of models into a MySql database Here we explain how the data can be loaded into a MySql database. Nevertheless the API can also be used with other database systems. Then, the mysql commands in the configuration (described in Section 2.3) should be modified. To load the model into the database, they should be downloaded and unzipped. Then the database/tables can be created and the content can be loaded using the scripts: create_de_70M_trigram.mysql or create_de_70M_mate.mysql You should modify the last columns by replacing the string /path/ to the location where the models have been downloaded. 2.3 Configuration for the API To use the API an XML configuration is used. An almost read-to-use configuration is provided within the project with the names: - conf_mysql_de_70M_trigram.xml (for the trigram model) - conf_mysql_de_70M_mate.xml (for the Mate parser model) when using MySql none of the SQL commands needs to be changed, but the servername, user, password, databasename (if changed) should be adjusted: .... jdbc:mysql://SERVERNAME/de_70M_trigram?useUnicode=true&characterEncoding=UTF-8 USER PASSWORD com.mysql.jdbc.Driver ... The last shown line needs to be adjusted, if using a different database system. 3. Using the API 3.1 Connecting to the database To instantiate the API the path of the MySql configuration is needed and given to the constructor of the class DatabaseThesaurusDatastructure. This class is an implementation of the IThesaurus class which contains all methods that can be available for a full model. The DatabaseThesaurusDatastructure class returns all results wrapped in a datatype of the framework. If you prefer working with HashMaps, you might consider using the implementation DatabaseThesaurusMap which wraps the results into Strings within a HashMap String config = "conf_mysql_de_70M_trigram.xml" DatabaseThesaurusDatastructure dt ; dt= new DatabaseThesaurusDatastructure(config); dt.connect(); At the end of using the API the connection should be closed again using: dt.destroy() 3.2 Functions of the API The API has several methods to access the data. Here we will shortly describe the most important ones. A more detailed documentation of all classes can be found in the JavaDoc (http://maggie.lt.informatik.tu-darmstadt.de/jobimtext/doc/org.jobimtext/) Additionally, the class MySqlExamples illustrates the usage of the API and prints results using both models. - getSimilarTerms(term): List this function returns all similar terms to a given term. As a result a list of Order2 instances is returned. Each Order2 object contains the similar word and its similarity score. [The contextScore is not yet used] - getSimilarTerms(term,N:Integer): List this function returns the N top most similar terms to a given token. - getSimilarTerms(term,D:Double): List this function returns all similar tokens with a similarity score above D. - getTermCount(term): long This function returns the frequency of the term within the processed corpus - getContextsCount(context): long This function returns the frequency of the context within the processed corpus - getTermContextsCount(term, context): long return the number of co-occurrences of term and context - getTermContextsScore(term, context): long returns the significance score of the term and context - getTermContextsScores(term): List This function returns the significant contexts (Order1 objects) for a given term. An Order1 object contains the context feature as well as the significant score and the frequency of the term and the context. - getTermContextsScores(term,N:Integer): List This function returns the top N contexts for a given term. - getTermContextsScores(term,D:double): List This function returns contexts with a significance score greater then D. - getSensesTypes(): String[] This function gets the names of the available sense computations - getSenses(term): List This function gets a list of Sense objects of the standard sense type. Each sense object contains a list of words belonging to the sense and a list of IS-A's which label the sense - getSenses(term,sense_type): List This function returns a list of Senses for a specified sense type. [1] Biemann, C., Riedl, M. (2013): Text: Now in 2D! A Framework for Lexical Expansion with Contextual Similarity. Journal of Language Modelling 1(1):55--95 [2] Gliozzo A., Biemann C, Riedl M., Coppola B., Glass M. R., Hatem M. (2013): JoBimText Visualizer: A Graph-based Approach to Contextualizing Distributional Similarity. Proceedings of the 8th Workshop on TextGraphs in conjunction with EMNLP 2013