Spring Batch — Import CSV to Database (Complete Guide)

Spring Batch — Import CSV to Database (Complete Guide)

Importing CSV files into a database is one of the most common batch tasks. This guide shows a practical, production-ready approach using Spring Batch: how to read CSVs, validate records, skip faulty rows, write to DB efficiently, and design for restarts and observability.

Use FlatFileItemReader + ItemProcessor for validation + JdbcBatchItemWriter or JpaItemWriter for writes. Add faultTolerant() with skip or SkipPolicy, tune chunk size, and ensure DB connection pool >= concurrent writers.

Why this guide is different

  • Focus on real-world pitfalls (validation, duplicates, partial failures)
  • Production advice: retries, skip listeners, idempotency
  • Configuration and tuning: chunk size, DB pool, throughput
  • Complete runnable snippets (drop into Spring Boot)

Maven dependencies

<dependency>
  <groupId>org.springframework.boot</groupId>
  <artifactId>spring-boot-starter-batch</artifactId>
</dependency>
<dependency>
  <groupId>org.springframework.boot</groupId>
  <artifactId>spring-boot-starter-jdbc</artifactId>
</dependency>
<dependency>
  <groupId>com.h2database</groupId>
  <artifactId>h2</artifactId>
  <scope>runtime</scope>
</dependency>

application.properties (example)

spring.datasource.url=jdbc:h2:mem:testdb
spring.datasource.username=sa
spring.datasource.password=
spring.h2.console.enabled=true

# Spring Batch
spring.batch.job.enabled=false

# Hikari tuning (example)
spring.datasource.hikari.maximum-pool-size=30

Sample CSV (employees.csv)

id,name,email,department,salary
1,John Doe,john@example.com,Engineering,72000
2,Jane Smith,jane@example.com,HR,50000
3,Broken,invalid-email,Sales,45000
4,Sam Blue,sam@example.com,Engineering,80000

Entity (POJO)

public class Employee {
  private Long id;
  private String name;
  private String email;
  private String department;
  private BigDecimal salary;
  // getters & setters
}

FlatFileItemReader — robust setup

Use FlatFileItemReader with a LineMapper (tokenizer + field set mapper). Add @StepScope to inject job parameters like file path.

@Bean
@StepScope
public FlatFileItemReader<Employee> reader(@Value("#{jobParameters['file']}") String file) {
  return new FlatFileItemReaderBuilder<Employee>()
    .name("employeeReader")
    .resource(new FileSystemResource(file))
    .linesToSkip(1)
    .delimited()
    .names("id","name","email","department","salary")
    .fieldSetMapper(fieldSet -> {
      Employee e = new Employee();
      e.setId(fieldSet.readLong("id"));
      e.setName(fieldSet.readString("name"));
      e.setEmail(fieldSet.readString("email"));
      e.setDepartment(fieldSet.readString("department"));
      e.setSalary(fieldSet.readBigDecimal("salary"));
      return e;
    })
    .build();
}

Validation in ItemProcessor

Validate business rules in the processor. Throw an exception for invalid data so SkipPolicy or skip configuration can handle it.

public class EmployeeProcessor implements ItemProcessor<Employee, Employee> {

  @Override
  public Employee process(Employee emp) throws Exception {
    if (emp.getEmail() == null || !emp.getEmail().contains("@")) {
      throw new IllegalArgumentException("Invalid email: " + emp.getEmail());
    }
    if (emp.getSalary() == null || emp.getSalary().doubleValue() < 0) {
      throw new IllegalArgumentException("Invalid salary");
    }
    // trim / normalize
    emp.setName(emp.getName().trim());
    return emp;
  }
}

JdbcBatchItemWriter — efficient batch writes

JdbcBatchItemWriter is fast for bulk inserts. Use parameterized SQL and map fields from the item.

@Bean
public JdbcBatchItemWriter<Employee> writer(DataSource ds) {
  return new JdbcBatchItemWriterBuilder<Employee>()
    .dataSource(ds)
    .sql("INSERT INTO employee (id, name, email, department, salary) VALUES (:id, :name, :email, :department, :salary)")
    .beanMapped()
    .build();
}

Job configuration — chunk + fault tolerance

Use chunk-oriented step, tune chunk size (commit interval), and enable fault tolerance with skips. Also attach a SkipListener to log & audit skipped rows.

@Bean
public Job importJob(JobRepository jobRepo, Step importStep) {
  return new JobBuilder("importJob", jobRepo)
    .start(importStep)
    .build();
}

@Bean
public Step importStep(JobRepository jobRepo, PlatformTransactionManager txManager,
                       FlatFileItemReader<Employee> reader,
                       EmployeeProcessor processor,
                       JdbcBatchItemWriter<Employee> writer) {

  return new StepBuilder("importStep", jobRepo)
    .<Employee, Employee>chunk(100, txManager)
    .reader(reader)
    .processor(processor)
    .writer(writer)
    .faultTolerant()
      .skip(IllegalArgumentException.class)
      .skipLimit(50)
      .listener(new LoggingSkipListener())
    .build();
}

SkipListener — audit skipped records

public class LoggingSkipListener implements SkipListener<Employee, Employee> {
  @Override
  public void onSkipInRead(Throwable t) { /* log read errors */ }
  @Override
  public void onSkipInWrite(Employee item, Throwable t) { /* log write errors */ }
  @Override
  public void onSkipInProcess(Employee item, Throwable t) {
    // persist to error table or write to failed CSV for reprocessing
    System.err.println("Skipped: " + item + " because " + t.getMessage());
  }
}

Transactional behavior & restartability

  • Chunk commits happen per transaction. If a write fails, that chunk rolls back.
  • To support job restarts, rely on Spring Batch metadata (JobRepository / execution context).
  • Avoid in-processor state that cannot be reconstructed on restart (use StepExecutionContext if needed).

Idempotency: make writes safe

If a write could be attempted multiple times (retries, restarts), ensure idempotency. Use DB upserts or a unique constraint + insert-if-not-exists logic.

-- Example Postgres upsert (for JdbcBatchItemWriter)
INSERT INTO employee (id, name, email, department, salary)
VALUES (:id, :name, :email, :department, :salary)
ON CONFLICT (id) DO UPDATE SET
  name = EXCLUDED.name,
  email = EXCLUDED.email,
  department = EXCLUDED.department,
  salary = EXCLUDED.salary;

Performance tuning checklist

  • Chunk size: start with 100, test 50/200/500 — larger reduces commits but increases rollback cost
  • JDBC batch size: ensure your writer batches SQL statements (JdbcBatchItemWriter does by default)
  • DB pool: maxPoolSize >= concurrent writer threads
  • Monitoring: measure job duration, DB latency, connection usage, GC pauses

Observability: instrumenting with Micrometer

@Autowired MeterRegistry registry;
Timer processTimer = registry.timer("batch.employee.process.time");

public Employee process(Employee e) {
  return processTimer.record(() -> {
    // validation and enrichment
    return e;
  });
}

Testing & local debugging

# Trigger job via REST (example)
curl -s "http://localhost:8080/jobs/import?file=/tmp/employees.csv"

# Use JobLauncherTestUtils for integration tests
JobExecution exec = jobLauncherTestUtils.launchJob(new JobParametersBuilder()
  .addString("file", "/tmp/test.csv")
  .toJobParameters());
assertEquals(BatchStatus.COMPLETED, exec.getStatus());

Troubleshooting common issues

SymptomPossible causeFix
Many skipped rows Validation rules too strict or bad CSV Log failed rows, relax/adjust validation, provide sample to users
Job slow / DB connections exhausted Too many concurrent writers vs pool size Increase Hikari maxPoolSize or reduce parallelism
Duplicates after restart Writes not idempotent Use upsert or unique constraints + dedupe logic

Real-world pattern: Failed-items dead-letter table

Persist failed rows into an error table with failure reason and job metadata. This allows analysts to inspect and reprocess only problematic items.

CREATE TABLE employee_failed (
  id BIGINT,
  payload CLOB,
  error_message VARCHAR(1000),
  job_name VARCHAR(200),
  failed_at TIMESTAMP
);

Sample REST controller to trigger job

@RestController
@RequestMapping("/jobs")
public class JobLauncherController {

  @Autowired JobLauncher jobLauncher;
  @Autowired Job importJob;

  @GetMapping("/import-csv")
  public String importCsv(@RequestParam String file) throws Exception {
    JobParameters params = new JobParametersBuilder()
      .addString("file", file)
      .addLong("time", System.currentTimeMillis())
      .toJobParameters();
    jobLauncher.run(importJob, params);
    return "Job launched";
  }
}

FAQ

  • Is FlatFileItemReader thread-safe? No — use StepScope or partitioning to avoid sharing a reader across threads.
  • How many records per second? Depends on DB, network, and chunk size — benchmark! Use Micrometer counters.
  • Should I use JdbcBatchItemWriter or JpaItemWriter? JdbcBatchItemWriter is usually faster for bulk inserts; JPA can be used if you need entity lifecycle features.

Conclusion

This pattern (reader → processor → writer) with robust validation, skip & audit, idempotent writes, and sensible tuning will make CSV import jobs reliable and maintainable in production. If you'd like, I can convert this into a runnable GitHub repo (pom, Spring Boot app, sample CSV) so you can clone and run locally.


๐Ÿ“ฅ Related Spring Batch CSV Processing Guides

Build robust CSV-to-database batch pipelines by exploring related Spring Batch concepts such as file handling, processing logic, error management, and performance tuning.

๐Ÿงฑ Spring Batch Core Components

Understand how ItemReader, ItemProcessor, and ItemWriter work together in a complete CSV-to-DB batch job.

๐Ÿ“‚ Read Multiple CSV Files

Extend CSV-to-DB jobs to support multiple input files using resource-aware ItemReaders.

๐Ÿ”„ Spring Batch ItemProcessor Example

Apply validation, transformation, and enrichment logic before writing CSV records into the database.

๐Ÿšซ Skip Policy & Error Handling

Handle malformed CSV rows and database write failures gracefully using skip and fault-tolerant configurations.

๐Ÿ”€ Conditional Flow in Spring Batch Jobs

Control job execution paths based on CSV validation or database write outcomes.

๐Ÿงต Multithreaded Step in Spring Batch

Improve throughput when importing large CSV files into the database using parallel processing.