Sunday, 22 June 2014

Java Code Review Checklist


There exist many code review checklist blogs/articles in the net telling standard Java best practices/guidelines; And I have come up with a checklist accommodating best practices from some of the recent popular books (reference given) and guidelines from Oracle (on Security):

Please navigate to following link to see my post:
http://java.dzone.com/articles/java-code-review-checklist

Sunday, 1 June 2014

Postgres Performance Improvement Tips for Large Data Set Tables

Problem Statement

Slow running queries against a large data set tables (>5 GB)

Areas You May Have Already Looked into 

  • SQL Query Tuning 
  • and JDBC Code Optimization

Areas to Explore More

  • DB Design
    • Disk I/O
      • Number of Columns in a Table
      • Choose right Data Types for Columns
    • Tune DB Buffer Size
      • To improve Caching
    • Standard recommended solution to deal with large data set tables
      • Table Partitioning 
        • Since at a certain point you realize the data size in the database is the bottleneck
  • Application Design
    • Tune JDBC Code and Design
    • Apply Application Level Cache

How Postgres Stores Data

In order to understand the performance improvements suggestion, first we need to understand some of the basic architecture of PostgreSQL:

  • Disk Files
    • All table, indexes are stored in OS files
    • Files can find under: <Postgres home>/pgdata/base/
  • Page Size
    • The size of a page is fixed at 8,192 bytes (8K)
      • you can increase or decrease the page size if you build your own copy of PostgreSQL from source, but all pages within a database will be the same size
    • All disk I/O is performed on a page-by-page basis, when you select a single row from a table, PostgreSQL will read at least one page
    • Heap page & Index Page
      • Two different pages one for actual table data and another for index data
  • Heap and Index Cache Hits
    • Disk I/O is expensive
    • Caches heap and index pages
      • Postgres itself tracks access patterns of your data and will on its own keep frequently accessed data in cache
      • Memory access is fast; disk access is slow. 
    • Insertion order matters for effective caching

 Minimize Disk I/O

  • Normalize Database
    • Remove unnecessary columns
  • Choose Right Data Types
    • Avoid using larger size data type if data values are small and can fit into smaller data types
    • It has a direct impact on Cache hits

Tune DB Buffer Size

  • Adjust the DB Buffer Size
    • To improve heap block and index block cache hits
    • You could increase the cache size to be large enough to hold the entire table, but that's a large shared memory segment, and if you don't have enough physical memory, your system will start to thrash

Dynamic Table Partition For Large Data Set Table

  • Table partition standard solution for large data set tables
  • The average number of heap/index blocks you'll have to navigate in order to find a row goes down
  • Partition also benefits on choosing the right scan types during query
  • There are some maintenance advantages too. You can DROP an individual partition, to erase all of the data from that range. This is a common technique for pruning historical data out of a partitioned table, one that avoids the VACUUM cleanup work that DELETE leaves behind.
  • Dynamic partition rules can be setup which minimizes maintenance overhead and transparent to application layer
  • Tips:
    • On what column to partition matters
    • No. of partitions should not be large ( < 100)
    • Race condition if two separate transactions inserts, may get two child partition tables with the same name if not taken care properly

Thank You!