Topicalizer
Topicalizer is a tool that allow you to process a bunch of SOAP API descriptors in order to group the technical information they contain, in semantic related categories, and specifying this categorization as RDF statements stored in a Sesame triple-store. As a first step in the process of categorization, this tool applies text processing procedures over the service descriptors for extracting some relevant technical information (operations, documentation and datatypes). Once such information is available, the tool fits a probabilistic topic model, known as Online LDA, for infering a set of relevant categories (topics–distributions over terms in a fixed vocabulary), and associate a probability distribution over such topics for each one of the service operations processed by the tool.
The Online LDA processing is based on the implementation of ONLINE VARIATIONAL
BAYES FOR LATENT DIRICHLET ALLOCATION
by Matthew D. Hoffman, which uses the online
Variational Bayes (VB) algorithm presented in the paper “Online Learning for Latent
Dirichlet Allocation” by Matthew D. Hoffman, David M. Blei, and Francis Bach.
The algorithm uses stochastic optimization to maximize the variational objective function for the Latent Dirichlet Allocation (LDA) topic model. It only looks at a subset of the total corpus of documents (namely, text files whose content has been extracted from Web APIs documentation archives) each iteration, and thereby is able to find a locally optimal setting of the variational posterior over the topics more quickly than a batch VB algorithm could for large corpora.
Files/Directories provided:
lib/
: Java Dependencies.onlinelda/
: Folder containing the Python scripts which uses online VB for LDA to analyze the information that has been extracted from SOAP API descriptors.outcome/
: This folder holds the .txt and .csv files containing the results of applying Online LDA (topics.<txt/csv>
andper-document-topics.<txt/csv>
).sesame_war/
: Folder containing deployable distribution of theSesame Framework
Server (openrdf-sesame.war
andopenrdf-workbench.war
).run.sh
: Execution script.sample-service-uris.txt
: File containing a list of 80 service descriptors available online (used as a sample input for the tool).WebAPIDocProcessing.jar
Java App for performing parsing and text processing operations on the service descriptors.README.md
: This file.
You will need to have the numpy and scipy packages installed somewhere that Python can find them to use these scripts.
System Requirements:
Java 1.6.x
or greater.MySQL 5.5.x
Apache Tomcat 7.x
(or any available servlet container, listening at 8080 port).
Initial Settings
- In MySQL create a Database with name
service_registry
. - Deploy both of the Sesame Framework .war files on your servlet container.
After you have deployed the Sesame Server webapp, you should be able to access it, by
default, at path
/openrdf-sesame
(/openrdf-sesame/home/overview.view
for Apache Tomcat 7). - Create a new
Native Java Store
with IDWebAPIModel
in the Sesame Server, by accessing http://localhost:8080/openrdf-workbench/ ->New repository
. -
Give execution permissions on the
run.sh
script. Open a terminal and type:$chmod u+x run.sh
Running
-
Open a terminal and type
./run.sh
followed by the path of a text file containing the list of service descriptor URIs. You could use thesample-service-uris.txt
provided with the tool.$./run.sh sample-service-uris.txt
-
After running the whole process, you could verify that the Sesame store you have created has been populated with RDF statements corresponding to the categorization extracted by running the Online LDA algorithm. These RDF statements instantiate the Classes and Properties defined in the RDF Schema model available at
onlinelda/rdf_sesame/web_api_model.rdf
. Additionally, the categorization results are also available as .txt and .csv files at theoutcome/
foder:-
per-document-topics.<txt/csv>
: distribution over topics for each one of the processed operations. -
topics.<txt/csv>
: distributions over terms for each one of the topics extracted.
-