Advanced Importing of Application Initial Data

4 min readMar 29, 2020

Working on my pet project, LoCoRepo, there’s a challenge I faced which I think is interesting to share. My application imports data based on user’s git repository. So there are a bunch of files on one’s git repository which will be parsed by the application and then the parsed data will be persisted in db. That’s the main functionality of the application, but some of these repositories are considered “Initial Data” and this “importing” need to happen before “anything else”.

First thought

Usually we handle these scenarios with database migration tools such as Liquibase or Flyway but this is more advanced than those because there is business involved which is the processing of the git repository. Besides, my database is a NoSQL database and actually recently I’ve been thinking about breaking my data to two types of data, graph-based data and none-graph data. Those db migration tools don’t support NoSQL databases well and initial research didn’t show any interesting results!

Second thought

I could use PostConstruct annotation in my spring bean’s method to do this on application start-up which is what I did at first. Now, when my application starts, it will start processing the initial git repositories and fill the db if it’s not been filled before. The problem with this solution is that it’s way too slow! For example I am cloning a giant git repository, then switch between its git tags to extract the data. To make matters worse, this will happen for every instance of my application, so when my Kubernetes wants to run a new instance of my application, it has to go through this again (at least the git part), although I’m almost sure the data is there already.

Final thought

If I want to solve the problems of previous solution, I should extract this part from my application. So let’s break it into two parts: Extractor and Importer.

The “extractor” application will do the processing of Git repositories and exports the data into something that will be used by the “importer” application to import the processed data into production database. When I reached this point, something else clicked in my head that I had postponed so far! The git repositories change and so should my database. Now that I have these “standalone” applications, I should be able to run them as scheduled tasks and I think Kubernetes jobs should be a good candidate for this (although I don’t know much about them, but a skim of the page looks promising).

Now let’s get into more details of the gray areas! First stop, I mentioned the extractor will export data into something. That “something” should be kept in sync with the production database. What format of data is appropriate here? we should probably take a similar approach to our database migration tools we mentioned at first. So “something” is a list of changes of the data throughout time. As I mentioned the main business of the extractor application is the same as the main domain functionality, so all we need to do is to override our repositories in a way that it will work by calculating history of our data and exports it into files. For this purpose, I created a new module which depends on my main domain module, and using the Primary annotation, I am overriding the repository beans with my custom implementation. The custom implementation will keep an in-memory linked list of entities as the current state of the “Table” and will keep the changes persisted in file(s). I could probably use ReentrantReadWriteLock to ensure thread-safety also.

For example, during an update of an entity, I will just find that entity on the linked list and replace its value with the new value, and will also add an “update log” to the list of immutable changes which is persisted in file(s). As you guessed, a removal of an entity, will remove the entity from the linked list and will add a “remove log” to the list of immutable changes.

How should we keep updating this list of changes? I think if I just keep the “change logs” files in a volume, the next time the job is triggered, I will be able to initialize the current state by processing the change logs first myself, and then continue with the normal application process. This way I will add only the new change logs.

Importer’s job should be easy now. Go through these “change logs” and do the appropriate action based on each entry. I will update this post if I face anything worth mentioning when implementing that part.

Update: As always there were challenges not seen when I started working on this. But, not enough to completely change the solution.

One challenge I had was to save/load my “entities” into/from files. I used Jackson to export my entities into YAML files. Sometimes I had to use a Jackson serialize/deserialize customization, but thanks to MixIn, I escaped touching my entity classes.

The next one I am working on now is that at first I was exporting one file per Entity. If I do that, how do I know which file should I pick first? My database is actually a NoSQL database, so no foreign keys exist, but still, the beauty of this solution was that it would “record” a business action and “playback” later. So now, I think I will change my solution to export each execution of export into a new file. And then for the next execution, if there is data change, it will just create a new file. This way, I will keep the order in which each record was inserted into the file also.

Advanced Importing of Application Initial Data

First thought

Second thought

Final thought

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Amir Vosough

No responses yet