Issue with data import for large data sets

Post Reply
ToniVerbeiren
Posts: 5
Joined: 13 Aug 2010, 05:55

Hi,

I'm looking into using the integration tab for importing relatively large sets of data (2000 technical nodes, including deployment info on OSs, etc.).

I successfully created a python script, like the one included in integration examples. The problem is that once the file to be executed (via execfile()) is above a certain size (approximately 182Kb), the interpreter throws an error instead of parsing it.

I suspect that this is the reason for having 'essentialImport_all.txt' and 'essentialImport_0.txt' in the integration examples?

Is this correct? Is there a way to avoid the splitting of the file? I generate the scripts from within Excel using a simple vbs script. I would like to avoid having to set up the splitting in vbs...

Thanks!
Toni
User avatar
jonathan.carter
Posts: 1087
Joined: 04 Feb 2009, 15:44

Hi Toni,

Yes, there is a 'limit' in the Script Console tab of something like 64KB that as I understand it, is actually a feature of the underlying Bean Scripting Framework that the Script Console uses.

You are correct in that the 'essentialImport_0.txt' etc. in the examples is precisely because of this limit. The Essential Integration Server and its replacement, the Essential Integration Tab manage this for you and 'chunk' the script into 32KB chunks (just to be safe) that are run sequentially in the same context.

The 'essentialImport_all.txt' from the Essential Integration Server is a helper file that runs each of the chunks.

Co-incidentally, I've just been working on a version 1.1 of the Essential Integration Engine (used by the Essential Integration Tab) that enables it to run pre-built integration scripts, such as you have created, chunking them and running each chunk sequentially. I need to update the Tab user interface to support this new feature. The idea is that you would just supply your ready-to-run integration script and bypass all the XML transformation that Integration Server and Tab expect to run.

I realise that this doesn't help you right now, but I expect to have that update to the Integration Tab done shortly. Until then, you might want to look at a simple approach in your VBS that breaks the output Python after every 100 or so technical nodes. If you run each script one after the other in the Script Console Tab, the context of the script remains for each call you make to execfile(). This means that you can reference variables that you defined in earlier chunks etc.

Let me know how you get on with this, or have any further questions about the integration import.

Hope this helps and look out for the version 1.1 of the Integration Tab which will take your integration script and manage the chunking etc. for you.

Jonathan
Essential Project Team
ToniVerbeiren
Posts: 5
Joined: 13 Aug 2010, 05:55

Hi Jonathan,

Thanks for the quick answer and confirmation. I'm looking forward to the update of the integration module.

I modified the Excel macro to split the output in different files, just like it is done by the integration tab. It gets the job done... but it brings me to my next question:

I'm importing data from different sources. After some experimentation, I found that creating 200 instances, including some references and slot attributes, quickly takes 1 minute to process. When annotation is active, it take more than an hour! All this time, the server is in a frozen state, although it is not really doing much. I guess this is because it has to handle all the change information.

Waiting one minute is still feasible, the latest import file contains 4000 or so instances with attributes. Unfortunately, even without change tracking, it takes ages to process, even on a local (in-memory) knowledge base!

Do you have any ideas on how to improve performance? Are there ways to speed up the integration?

Thanks!
Toni
User avatar
jonathan.carter
Posts: 1087
Joined: 04 Feb 2009, 15:44

Hi Toni,

You're correct that if you have the change tracking enabled (as part of the Collaborative Protege) then the server has a lot of extra work to do. When I know I have a lot of items to import, I usually turn of the change history capability before running the import and re-enabling it when the import is complete. This should significantly improve performance and will also greatly reduce the memory requirements.
Actually, until the most recent versions of Protege, when the annotations were managed via a file-based repository, this was a real problem (and my server often ran out of memory). However, if you have not already done this, using a database backend for the annotations repository should again greatly improve performance.

If you are still having problems with the performance, it's then worth reviewing your import script code to ensure that there are not any serious inefficiencies in there (e.g. loops in loops or something). Just the other day I imported details of over 700 physical servers in a matter of seconds. Each of these servers required additional instances to be created for attributes, relationships to other instances and so on, which means that in practice, this import would be creating over 1000 instances in the repository.

Are you using the Protege API directly, or have you been able to use the set of helper Python functions that are available in 'standardFunctions.txt'?
I'm going to publish an initial version of the how-to guide for using these functions and creating integrations in the next few days.

Jonathan
Essential Project Team
ronancr
Posts: 1
Joined: 05 Nov 2010, 18:16

Hi,

We have performance problems importing. We are using Power5+ with 2.1 Ghz processor and linux, multi-user mode, with postgresql 8.1. It takes more than 10 minutes to import about 1000 items, and I see that the java process goes 100% CPU utilization. Having more processors doesn't help, because it is single thread.

I am using the python script available in "Data Load v2" (generateLoadCommand.py), then I use the function generated createApplicationProvider to create a single instance. To import 1000, I must create then one by one. But, by doing this I guess that the application must require some lock for each insert. Am I right? So, that justifies the long time it takes to import. Is it possible to turn off multi-user mode in order to avoid this lock?

Another issue: when I browse thought classes that have a larger number of instances, it takes a long time to build the Instance Browser list. I guess this is due the fact that the application gathers all the information needed to present in Instance Editor. Is it possible in some way to build this lists faster, perhaps not loading the instance details? I know that if I go to the Queries Tab I can obtain this, but I would like some way so that I don't have to build a query.

Thanks,

Ronan
User avatar
jonathan.carter
Posts: 1087
Joined: 04 Feb 2009, 15:44

Hi Ronan,

Depending on the nature of the import, a lot of processing can be required - which can take some time using the script console.

From what you're describing, it sounds like you are working directly in the Script Console tab. I have seen some interesting behaviour with the Script Console tab when running a scripts that run sub-scripts. It seems that running each script separately uses less memory and performs faster.

The Essential Integration Tab uses a different set of import functions that may be worth considering - it chunks large imports and runs them sequentially.

In terms of locks on the database etc., Protege is really very efficient at managing how the instances are persisted to the database but performance is bound to be affected if you are executing an import while others are using the repository.

There is no configuration control to turn off any locking but you can take the repository out of multi-user mode temporarily.
  • First, stop the server.
    Once it has shutdown, start Protege client on the server host and open your repository in stand-alone mode.
  • Perform the import, save your updated repository and close Protege client.
  • Finally, restart Protege server, which will now contain your imported instances.
From what you describe in your second issue about the loading of the instances in the instance browser, the most recent versions of Protege (3.4.x) have added a multi-threaded approach to loading these. If you can see them being added and can select instances while it is still loading, then you're probably already using the later version. If however, the instance browser is blank until all the instances have loaded, you are using an older version of Protege and should upgrade to the latest supported version - version 3.4.4

Hope this helps

Jonathan
Essential Project Team
Kevin Campbell
Posts: 40
Joined: 13 Sep 2010, 20:26

A follow up to Jonathan's most recent note:

The first time you open an instance list after opening the project it will take some time to populate, but from that point onwards you'll find it's pretty snappy. I just leave the Protege client open all the time, for weeks on end, without any problem to avoid the need to repopulate lists.

We've imported thousands of instances using the script console and it's a one time thing; a bit time consuming initially but once the bulk of the model is populated it's really not such an issue.

Kevin
Post Reply