Protege - Deleting a large number of instances

Post Reply
melbpar1
Posts: 41
Joined: 19 Sep 2012, 06:18

Hello there,
Does anyone one know how to delete a large number of instances in Protege?
I have tried to select and delete all instances (about 6,000). It takes hours and then it comes up with a java heap size error. Even deleting 100 records at a time takes about 10 minutes
Thanks
User avatar
jonathan.carter
Posts: 1087
Joined: 04 Feb 2009, 15:44

Hi,

I have seen this but it really depends on which Class of instances that you're deleting. Classes with many inverse-slots (bi-directional relationships) are typically the ones causing problems.

As a bit of background, this happens because Protege is having trouble processing all the update events. Normally, deleting an instance at a time is just fine but is obviously no use for deleting 6000 instances!

I have a script that you can run in the script console for deleting the External Repository Instance References (if you've done a lot of importing, there are often a lot of these) and I've run into problems deleting these en-masse.

I'll tweak the script so that you can use it to delete all instances of any specified class and share this on the Share area.

Hope that helps

Jonathan
Essential Project Team
User avatar
jonathan.carter
Posts: 1087
Joined: 04 Feb 2009, 15:44

I've uploaded a Python script that you can run in the Protege Script Console tab that should help here. I've just tested it with over 3400 Business Processes and all completed in a matter of a few seconds.

Let me know if this helps

Jonathan
Essential Project Team
melbpar1
Posts: 41
Joined: 19 Sep 2012, 06:18

Jonathan,
Thanks for your response and the script.
Unfortunately, after a few minutes from the time when I issue the deleteAllInstances command, I get the following error: java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space

I increased the heap size to 1.3G as follows: lax.nl.java.option.java.heap.size.max=1395864371 but still no luck.

Is there a way to have an additional parameter in the script to delete only a number of instances? I would be happy to be able to delete 2,000 at a time instead of none at all.

By the way, I am using Protege v3.4.8

Thanks
User avatar
jonathan.carter
Posts: 1087
Joined: 04 Feb 2009, 15:44

You could push the heap a bit more on 32-bit Windows, to: 1470m (1541406720) but your idea makes more sense.

I'll tweak the script to delete in user-defined chunks, so you can experiment with how many you can remove in one go.

Jonathan
Essential Project Team
User avatar
jonathan.carter
Posts: 1087
Joined: 04 Feb 2009, 15:44

I've published an updated version of the script.

You now have to specify the size of the chunk to use, e.g.
deleteAllInstances("Business_Process", 100)

Let me know how you get on - I've just deleted 1350 instances in a second or so.

Jonathan
Essential Project Team
melbpar1
Posts: 41
Joined: 19 Sep 2012, 06:18

Jonathan
Thanks very much for the update.
I increased the heap size to 1470m (1541406720).

Tried to run the script and realised that it is attempting to delete all records regardless of the chunksize.

To get around this, I got rid of the While statement that wrapped the rest of the code (while anInsList.size() > 0:). The while statement was removing say a 100 instances, checking the size of the list and removing more until the list was empty.

After deleting 100 records a few times, I still get the memory error.

Is there a way to deallocate memory in python after every run?

Thanks
User avatar
jonathan.carter
Posts: 1087
Joined: 04 Feb 2009, 15:44

Thanks for the feedback and for trying out just deleting a specific number.
Very interesting that even if you delete 100 instances Protege is having trouble.

I've spent a lot of time, previously, looking at memory management in Python (Jython) and there really is nothing that you can do explicitly - the script engine is supposed to 'just take care of that' for you.

What I think is happening is actually some side-effects of the deletes that is causing Protege to do a lot more work than one would expect as it tidies up any related instances after deleting each specific one. My suspicion is that it could be the Default Architecture State that is causing trouble - and possibly any External Repository Instance References if you have done an import. I have seen this sort of thing before - (where External Repository Instance References were being related to the Default Architecture State) and the script that we've been working on here is derived from the script I had to create to remove such instances.

What I'll do is modify the deleteInstance() function to explicity remove the instance from the Default Architecture State and to delete any External Repository Instance References it might have before attempting to delete the instance itself. I'll paste it in here for expediency and you can try that with deleting e.g. 100 instances and then building up from there.

Jonathan
Essential Project Team
User avatar
jonathan.carter
Posts: 1087
Joined: 04 Feb 2009, 15:44

I've created a new version of the script and done some testing with around 4000 instances.

The issue seems to be that the instances we want to delete are mapped to an Architecture State - and if it's the Default Architecture State, that's an instance with links to a very large number of other instances and so Protege is doing a lot of work managing those objects.

The new script attempts to help Protege out by explicitly unlinking any Taxonomies, External References (deleting those as it goes to avoid orphans) and Architecture States before deleting the instance itself.

Some experiments suggested that actually the chunk size you choose makes little, if any difference - pretty much what you've found.
I managed to get Protege to do this in around 9 minutes using up to 1.43GB memory for my 4000 instances.

This seemed a bit borderline, so I've tweaked the script to temporarily stop update events until the deleting is complete. Protege can handle this and saves it chasing every delete event. This cut my runtime to around 4 minutes and used under 1GB memory.

Thanks for your patience with this but, hopefully, this latest script (updated in the Share area) will get your instances deleted.

Jonathan
Essential Project Team
User avatar
jonathan.carter
Posts: 1087
Joined: 04 Feb 2009, 15:44

As an alternative approach to removing these instances

In Protege:
1. Go into Strategy Management and then Architecture State in EA_Support
2. Select the Architecture State called 'Default EA Architecture State' and delete it.
3. Now run the deleteAllInstances() script.

It's still worth using step 3 as this will ensure that any external references for your 6000 instances are also deleted in a tidy fashion.

I've just tested this with my 4000 instances and it completed in a matter of seconds (as opposed to 4 minutes!) :D

Keep me posted with progress - and I'm going to update that script to provide 2 alternative functions:

1. Delete all instances of specified class
2. Delete N instances of specified class (and don't then do the next N!)

Jonathan
Essential Project Team
User avatar
jonathan.carter
Posts: 1087
Joined: 04 Feb 2009, 15:44

Updated script file now available in the Share area.

Jonathan
Essential Project Team
melbpar1
Posts: 41
Joined: 19 Sep 2012, 06:18

Thanks Jonathan
The alternative method worked for me. It took only a few seconds to delete all records once I deleted the Default EA Architecture State and ran your latest script.

Having said that, I am not sure what the effect is on the rest of my ontology once I delete the Default EA Architecture state.
Should I re-create it and link any previously associated instances?

Thanks
User avatar
jonathan.carter
Posts: 1087
Joined: 04 Feb 2009, 15:44

No, leave the Default Architecture State out. There will be no side affects.

On reflection, this architecture state does not really do anything for you - other than cause these sorts of problems and it has been removed from the Essential Meta Model version 4.

Jonathan
Essential Project Team
Post Reply