Interacting with Protege project through RMI from external programs

Post Reply
mdpremkumar
Posts: 26
Joined: 17 Apr 2017, 09:06

Hi,

Is it possible to interact with protege server project from an external program using RMI?
If anybody has done something similar to this please share your suggestions and if possible, please share some code.

If this could be achieved, using such an external program, data from other data sources could be populated into EA model. We could make use of import utility for this purpose, but it looks very slow when importing huge data. I will initiate a separate post about the problems which we are facing with import utility.

Thank you.
Prem
User avatar
jonathan.carter
Posts: 1087
Joined: 04 Feb 2009, 15:44

Hi,

Yes it is possible and this is how the import utility interacts with a Protege Server project.
However, because of the high volume of interactions that take place over RMI, the performance will be noticeably affected compared to interacting directly with the Protege API. There is certainly an overhead to interacting via RMI and this is something that the Protege team themselves found when they changed the way Protege client connected to Protege server and loaded the requested repository.

Jonathan
Essential Project Team
mdpremkumar
Posts: 26
Joined: 17 Apr 2017, 09:06

Hi Jonathan,

Thank you for the quick response.

We are actually trying to make use of import utility to sync data from servicenow into EA model.
I have initiated a separate post about the issues which we are facing with import utility.
Here is the link to that post.
https://www.enterprise-architecture.org ... =21&t=1799

Just as an alternative to import utility, I have asked the question about using RMI interaction with protege project.
Even if we try to make use of a custom program to interact with protege project through RMI, do you think the performance will be very similar to the import utility?

If possible, request you to have a look on the post I have mentioned above and share some thoughts on that.

Also please let me know if there is a faster way of importing external data into EA model.
For example, if a project has been configured for client-server access, it is still possible to have protege installation in that server and open protege project file locally. Like that, if there is a python script available within the file system of that server, is there a way to execute that script on protege project automatically through some job without using RMI or opening the project in protege and executing the script through script console manually?

Thanks.
Prem
User avatar
jonathan.carter
Posts: 1087
Joined: 04 Feb 2009, 15:44

Hi Prem,

Thanks for your question.

Although Protege performs well once you’ve got the project loaded, the load-time in server mode - using RMI - can often be long. Network performance is also a key factor in this and it would be worth checking to see whether your Import Utility is running on the same sub-net as your Protege server. If not, that can have a huge impact, presumably down to the sheer volume of RMI calls that are being made.

The strategy that the import utility takes is to behave in a similar way to that of a real user. In this way, all of the constraints, validations and update events within Protege are still applied to ensure that the repository is consistent - in particular for those real users who are also logged on whilst the import is running.

Having said all that, we have noted your performance issues and we are working on that now. It may be that for this scale of import objects, we need to use a rather different approach but we think that it is worth exploring the Python map() approach - with which we’ve had excellent results in other areas outside of imports.

You can certainly run your Protege project in local mode (not server) but still backed by a database and this could improve performance. However, the Import Utility only operates with Protege projects that are file-based (not backed by a database), so running the Import Utility on a non-server project won’t really help unless you run it locally, export and save it and then convert that project to be used by your server. All of that will require a little down-time on the server.

The issue with the script console is that it has a limited (32KB) buffer for running any script. It is worth noting that the Script Console tab was a very large influence on our Python-based import approach.

You could look at building a Protege Job (it does have that ability) that can run the script - rather than running over RMI - but I think that it would be of greater impact to look at the use of the map() approach in the Python first.

In terms of any other faster approaches for getting these volumes into Protege, when running from a file-based repository, it is possible to ‘hack’ the .PINS file that contains all of the instances. Assuming you namespaced all your instance IDs, you could re-write your XML to the Protege LISP format and effectively ‘paste’ the 32000 instances straight into the repository instances file. Any elements that you need to refer to - e.g. a relationship to an instance that is already in your repository, you’d have to find the instance IDs for those and use that as part of your ‘import’.

Jonathan
Essential Project Team
mdpremkumar
Posts: 26
Joined: 17 Apr 2017, 09:06

Hi Jonathan,

Thanks again for your suggestions.
Regarding usage of map() function, are you suggesting to make use of this instead of traditional for loop to find instances?

If that is the case, I actually gave a try with sample code as follows.

Code: Select all

productToFind = "Xml"
nameSlot = kb.getSlot("name")
def findProduct(product):
	techProdName = product.getOwnSlotValue(nameSlot)
	if techProdName != None and techProdName.lower() == productToFind.lower():
		return product

def findProductTraditional(products):
	for product in products:
		techProdName = product.getOwnSlotValue(nameSlot)
		if techProdName != None and techProdName.lower() == productToFind.lower():
			return product

def findProductUsingListComprehension(products):
	return [product for product in products if product.getOwnSlotValue(nameSlot) != None and product.getOwnSlotValue(nameSlot).lower() == productToFind.lower()] 

products = kb.getCls("Technology_Product").getDirectInstances()
a = datetime.datetime.now()
returnedProducts = map(findProduct,products)
print [product for product in returnedProducts if product != None]
b = datetime.datetime.now()
print "Finding product using map: %s" % (str(b-a))


a = datetime.datetime.now()
print findProductTraditional(products)
b = datetime.datetime.now()
print "Finding product using traditional for: %s" % (str(b-a))

a = datetime.datetime.now()
print findProductUsingListComprehension(products)
b = datetime.datetime.now()
print "Finding product using list comprehension: %s" % (str(b-a))
Above code makes use of three approaches (map, traditional for loop and list comprehension) to find exact match for a technology product. When testing this code with 36000 technology product instances, always traditional for loop works better than the other two options. I am not sure if the way I have used map is right or not. In the way I have used, map function will return a collection with all instances as None and the matching instance (So again we have 35999 None instances + 1 matching instance). Hence I need to filter only the instance which is not of NoneType. Please suggest the right way of using map if I have done it wrong.

Some tips on performance improvements in Python guides us to make use of list comprehensions. But that is also looking slower in this case.

Next thing is, to make use of Protege script jobs, could you please point out some documentation references on how to do it? Should that be done in Java code only or could be done using any program?

Finally for your information, the environment were we are doing our testing is as follows.
Our dev server has enough RAM (10 GB) and we have set memory settings as recommended by https://www.enterprise-architecture.org ... essential/

All our programs like import utility, the C# program which posts XML to import utility are running in the same dev box. Our protege project stores data in MySQL database. Before testing on database based project, I took a snapshot of that in protege file mode and checked the initial run of 36000 records. It actually broke at 24250th record in our case due to GC overhead. The time taken for this was more than 30 hours. Then I gave a try on database based project and the speed was little bit fine. This time all records got added in EA model in a duration of 20 hours.

Thanks.
Prem
User avatar
jonathan.carter
Posts: 1087
Joined: 04 Feb 2009, 15:44

Hi Prem,

I’ve re-worked the script library that is used by the Import Utility to query and update the Protege repository in a different way. I’m seeing significant performance improvements with this new version of the ‘standardFunctions.py’ file.

I realised that there are some issues with using map() to achieve what we were trying to do, so have taken a slightly different approach.

Here is a ZIP file that contains the updated function library along with our out of the box XML transform. You can upload this ZIP into your XML Import Activity and try again. It would be great to know what kind of performance improvement you see. I tested with over 20,000 instances and this ran in about 12 minutes, compared to 4 hours with the previous approach.

Jonathan
essentialXMLImport_v2.zip
You do not have the required permissions to view the files attached to this post.
Essential Project Team
mdpremkumar
Posts: 26
Joined: 17 Apr 2017, 09:06

Hi Jonathan,

Thank you for providing the latest script library and transform.
After evaluating this, I will sure post back the performance improvements results.

Thanks.
Prem
mdpremkumar
Posts: 26
Joined: 17 Apr 2017, 09:06

Hi Jonathan,

We used the latest script library and XSL transform for importing XML containing 2848 supplier and 36594 software technology product instances. The XML file initially contains all supplier instances and then nodes for technology products follow. In each technology product node, a reference to supplier instance was used. Following are our observations during the first run when using XSL without any modifications.
  • Total time taken was 27 hours (Please do not worry. Good news is following below :) )
  • Some of the supplier and product names were having unicode characters and created instances did not have exact source names
  • Relevant supplier instances were not linked to technology product instances
When testing again with a less sub set of records using old XSL, the above were working fine.

So thought that the problem could be due to unicode characters.
Adjusted the XSL to support unicode by doing the following.
  • Included # -*- coding=UTF-8 -*- as the first comment
  • Added "u" prefix to relevant method calls
  • Inside the XSL template namely InstanceSlot, modified the call to EssentialGetInstance method as follows, because in our case we are not going to use EA instance id as we import data from servicenow.
    anInstance = EssentialGetInstance(aType, "", aName, anExternalRefID, anExternalRepo)
After above modifications, some changes were done to our custom program as well to form import XML. There we included instance_type and instance_name attributes for supplier reference inside technology_product node. Then gave a try with next run.

Now everything worked fine and import completed in 6 hours.
In both the executions, import utility interacted with RMI based server project.
The space on C drive of our dev machine is less and I think on increasing that space, we may gain some more speed.

Compared to 30+ hours of execution earlier, this looks better now.

Thanks.
Prem
User avatar
jonathan.carter
Posts: 1087
Joined: 04 Feb 2009, 15:44

Thanks very much for the feedback and great to hear that you managed to significantly reduce the runtime.

Well done, spotting and resolving the UTF-8 / Unicode issue - it is definitely important if you have any characters that are outside of the ASCII character set range.

Jonathan
Essential Project Team
mdpremkumar
Posts: 26
Joined: 17 Apr 2017, 09:06

Hi Jonathan,

Thanks for the reply.
Yes! The new import utility is really having improved performance and the standard python library is making use of efficient instance finding methods.

I also have few other test results and like to hear your suggestions, just to make sure if we are on the right track or not.

Using a Java program, referred protege classes and opened PPRJ file with its absolute path using edu.stanford.smi.protege.model.Project class. Then from Project class, got access to KnowledgeBase class.
After getting access to KnowledgeBase, we accessed service now data and created the following (for testing purposes to check the performance).

36457 Technology_Product instances
2756 Supplier instances
8223 Technology_Node instances.

Overall process got completed in 34 minutes.

After this test, modified the java program to open Project using RMI with the help of the class edu.stanford.smi.protege.server.RemoteProjectManager. Reverted EA project to its original state and checked the timing for creating above mentioned instances.

Everything got completed in 2.5 hours.

For our requirement, this method looks fine for us.
Service now data is periodically synced with a SQL Server database.
Hence, from Java program, we query required data from SQL Server and populate EA instances.
We hope that since we are making use of Java classes from protege.jar, this should be a proper approach.
But please provide your valuable suggestions about this approach.

Thanks.
Prem
Post Reply