Friday, 12 September 2008

BootStrapping Grails With Spring Batch

Normal Bootstrapping

As part of the current app I'm putting together I needed to populate my database with data from some flat files. The initial approach was to create a BootStrap class for each of the domain classes. Each of my BootStrap classes read the input file, splits each line, maps each item from the line to a attribute of the domain object and then saves the object.
class DomainBootStrap {
  def init = { servletContext ->
    def domainFile = new File('./grails-app/conf/resources/DomainFlatFile.txt')
    domainFile.splitEachLine(',' handleLine)
  }

  def handleLine = { lineItems ->
    def domainClass = new MyDomainClass()
    domainClass.firstName = lineItems[0]
    domainClass.lastName = lineItems[1]
    domainClass.height = lineItems[1] ? lineItems[1].toDouble() : null
    //map all fields
    domainClass.save()
  }

  def destroy = { }
}
Each BootStrap file was tested with a integration test
import org.springframework.mock.web.MockServletContext

public class DomainClassBootStrapTests extends GroovyTestCase {

  def bootStrapper

  void setUp(){
    bootStrapper = new DomainClassBootStrap()
  }

  void tearDown(){
    MyDomainClass.list()*.delete()
  }

  void testInit(){
    bootStrapper.init(new MockServletContext())
    assert 520378 == MyDomainClass.count()
  }
}
This approach worked for all the tables apart from one file which contained five hundred thousand records. The BootStrapClass that was inserting the the large number of records failed to complete after running for fifteen minutes. The problem appears to be that the BootStrap was running in a single transaction and default test HSQLDB was slowing down to a crawl. Switching from a in memory database to a file based one by editing grails-app/conf/DataSource.groovy and running the test while tailing the testDB.log file showed that the the insert statements were slowing down and eventually stopping.
test {
  dataSource {
    dbCreate = "create-drop"
    url = "jdbc:hsqldb:file:db/testDB;shutdown=true"
    //url = "jdbc:hsqldb:mem:testDB;shutdown=true"
  }
}
Tweaking the tables from memory to cached and uping the hsqldb.cache_scale from 14 to 18 did not solve the problem. To solve this issue the BootStrap data needs to be committed after processing a certain number of records as well as splitting the file into chunks and processing it asynchronously.

Spring Batch Grails BootStrapping

Spring Batch has been written to handle bulk processing of data and so fits the BootStraping of large data sets in Grails. In Spring Batch language a process you need to run is called a job and these jobs can be broken down into steps. Each job is is registered in a JobRepository and can be run using a JobLauncher. In order to use Spring Batch in grails download the Spring Batch distribution (with dependencies includes everything you need) and add the core and infrastructure jars to your grails lib directory.

The JobRepository and JobLauncher

To define the JobRepository in the Grails add the following to the resources.groovy file.
jobRepository(org.springframework.batch.core.repository.support.MapJobRepositoryFactoryBean){
  transactionManager = ref("transactionManager")
}
This creates a in memory job repository that stores the state of batch job instances, the parameters that were passed to start the job, the context each job is running in and the status of each step in the job. You can use a use a different configuration to persist this information but the MapJobRepositoryFactoryBean reduces the amount of configuration needed. Notice the reference to the transactionManager bean that grails sets up for you. Now there is somewhere to store the jobs and record their status a way of launching them is required. To launch the jobs a Spring Batch defines a SimpleJobLancher that implements the JobLauncher interface. Add a instance of the SimpleJobLauncher to your resources.groovy file. Setting the taskExecutor to a instance of SimpleAsyncTaskExecutor allows the launcher to handle async jobs. The job launcher requires a reference to your job repository bean
jobLauncher(org.springframework.batch.core.launch.support.SimpleJobLauncher){
  jobRepository = ref("jobRepository")
  taskExecutor = { org.springframework.core.task.SimpleAsyncTaskExecutor executor -> }
}

ItemReader, LineTokeniser, FieldTokeniser and ItemWriter

Now there is a method of launching and managing the jobs we can go ahead and create our bulk import job. In spring batch each job is made up of steps. In this example there are three steps
  1. Read the file and parse each line.
  2. For each line in the file map each line item to a domain object field.
  3. Save the domain object to the database.
Spring Batch makes reading flat files easy by providing a class called FlatFileItemReader. A FlatFileItemReader requires you provide a LineTokeniser, FieldSetMapper and Resource. The LineTokeniser is given each line of the file to process and divided into seperate filds. Included in the framework is a DelimitedLineTokenizer that has a delimiter property that sets the value on which to split the file. The FieldSetMapper is a class you need to write that implements the FieldSetMapper interface. The FieldSetMapper interface defines one method mapLine that takes a FieldSet parameter. The FieldSet provides access to the fields by index. The resource is the file that contains the data and can be configured using the Spring FileSystemResource class. Now we have these components we can define the all itemReader bean definition.
import org.springframework.batch.item.file.mapping.FieldSetMapper
import org.springframework.batch.item.file.mapping.FieldSet

public class MyDomainClassMapper implements FieldSetMapper {

  def mapLine(FieldSet fs) {
    if(!fs) {
      return null
    }
    def domainClass = new MyDomainClass()
    domainClass.firstName = fs.readString(0)
    domainClass.lastName = fs.readString(1)
    domainClass.height = fs.readString(2) ? fs.readDouble(2) : null
    return domainClass
  }
}
myDomainItemReader(org.springframework.batch.item.file.FlatFileItemReader){
  fieldSetMapper = { MyDomainClassMapper mappper -> }
  lineTokenizer = ref("myDomainLineTokenizer")
  resource = ref("myDomainClassDataFile")
}

myDomainClassDataFile(org.springframework.core.io.FileSystemResource, './grails-app/conf/resources/DOMAIN_DATA.txt')

myDomainLineTokenizer(org.springframework.batch.item.file.transform.DelimitedLineTokenizer){
  delimiter = "^"
}
Saving the domain class is handled by a ItemWriter that extends the AbstractItemWriter and simply calls the save method.
import org.springframework.batch.item.support.AbstractItemWriter

class MyDomainItemWriter extends AbstractItemWriter {

  void write(Object item){
    if(!item.save()){
      println item.dump()
      item.errors.allErrors.each{
      println it
    }
  }
 }
}

SimpleStepFactoryBean

To put the item reader and writer into a step you use a SimpleStepFactoryBean and set each as the respective property. In order to make this step excute asynonsyly the taskExecutor must be set as SimpleAsyncTaskExecutor. By setting the commitInterval to 10 we can ensure the transaction logs do not grow too large. The SimpleStepFactoryBean also requires a reference to the transaction manager and job repository
myDomainItemWriter(MyDomainItemWriter)

myDomainDataStep(org.springframework.batch.core.step.item.SimpleStepFactoryBean){
  transactionManager = ref("transactionManager")
  jobRepository = ref("jobRepository")
  itemReader = ref("myDomainItemReader")
  itemWriter = ref("myDomainItemWriter")
  commitInterval = 10
  taskExecutor = { org.springframework.core.task.SimpleAsyncTaskExecutor executor -> }
}

SimpleJob

The step must be included in a job in order to execute and this is set up using a SimpleJob class.
myDomainDataJob(org.springframework.batch.core.job.SimpleJob){
  steps = [ ref("myDomainDataStep") ]
  jobRepository = ref("jobRepository")
}

Launching the Job From a BootStrap class

By injecting the jobLauncher into a BootStrap class along with the job executor the job can be executed.
class MyDomainBootStrap {

  def jobLauncher
  def myDomainDataJob

  def init = { servletContext ->
    JobParameters jobParameters = new JobParametersBuilder().toJobParameters()
    def jobEx = jobLauncher.run(myDomainDataJob, jobParameters)
    while(jobEx.isRunning()){
      Thread.sleep(5000)
    }
  }

  def destroy = { }
}
Inserting around 500000 records using this method took around 5 minutes. I hope this helps you to BootStrap your apps with large datafiles or just start using Spring Batch in your grails application.

6 comments:

  1. I have started using Spring Batch for similar reasons - it is very handy to leave all this stuff related with commits, reading huge flat files or even xml files to same tool.

    If you use simple synchronous TaskExecutor instead of asynchronous, you don't need "while ..." piece of code.

    ReplyDelete
  2. Hey Oliver,
    Did you by any chance have a description of the directories you put all these files into ? perhaps a code dump that we can use ?
    thanks

    ReplyDelete
  3. Hey Jamo,

    Sorry for the slow response I've been tied up with work for the past week.

    I put the bean definitions in a separate resource file under grails-app/conf/spring called bootstrapMyDomain.groovy. This included jobRepository, jobLauncher, myDomainItemReader, myDomainClassDataFile, myDomainLineTokenizer, myDomainItemWriter, myDomainDataStep and myDomainDataJob.

    This file was included my application context by adding loadBeans('grails-app/conf/spring/bootstrapMyDomain.groovy') in the resources.groovy file.

    I put the MyDomainClassMapper and MyDomainItemWriter in the src/groovy directory and the MyDomainBootStrap in grails-app/conf directory.

    The file containing my data was in /grails-app/conf/resources/ .

    Is that what you were looking for?

    ReplyDelete
  4. Ollie,

    I was hoping you would actually zip some of these files and put them somewhere for a download (after you sanitize of course). I will certainly try and follow your steps with the additional information and try to replicate what you did. It is amazing how little is written out there about how to bootstrap grails using the non conventional methods like direct SQL or spring batch.

    I appreciate your responses...

    ReplyDelete
  5. Hi Jamo,

    Looks like there was a couple of changes in Grails 1.1 that broke the example. I've uploaded a sample project here http://sites.google.com/site/neuralmonkeystatic/Home/bootstrapfrombatch.zip . The data file is in the bootstrapdata folder and I had to add Events.groovy eventCompileEnd closure to copy across the extra spring bean dsl.

    Let me know how you get on with it.

    ReplyDelete
  6. Oliver,
    Thanks very much for the project.. it was extremely useful and worked nicely. I have to make some adaptations to it but i really appreciate getting the source!!

    thanks

    ReplyDelete