Performance troubleshooting case study – Slow batch job in AEM

Problem statement

Slow batch job taking >60 mins in JVM based WCMS (AEM) system. Requirement was to bring it down to 15 mins.

Context

Adobe Experience Manager (AEM) used as eCommerce PIM (Product Information Management). Huge volume of SKUs maintained in the system. Non performant use-case involves manual trigger of exporting the product data. It requires reading the data in the repository, processing them and exporting the computed output in friendly JSON format. It’s a multi-threaded implementation.

Server details

  • 16 vCPU
  • 32GB mem
  • CentOs 7
  • JRE 7
  • AEM 6

Diagnostic steps

  1. When the non-performant use-case is running, a series of screen snaps from terminal running top command were taken. Top command revealed that 80% of CPU is idle consistently. CPU user time is around 10% most of the times. CPU IO wait time is low. Inference was that the application was not using the CPU.
  2. It was already a multi-threaded implementation and hence the application should’ve used all the available CPU cores. Series of thread dumps were taken for analysis. Analyzing the thread dumps with tools such as samurai and fastthread.io revealed that only one thread out of 10+ threads on the use-cases was running at any point in time. All other threads were in WAITING state to acquire a re-entrant lock. Programming mistake suspected as this is a classic case of lock contention.
  3. Code contributing to the use-case was reviewed and it’s been found that the code follows an anti-pattern implementation in AEM session sharing across threads. Single Oak session instance was shared across threads and Apache Oak forces the locks to prevent repository corruption. Locks prevent multiple thread execution in parallel fashion leaving the CPU idle. Aberrant code were fixed and performance improved.

Reference

Performance troubleshooting case study – Slow batch job in AEM