Sunday 16 October 2011

MySQL Large Table: Split OR Partitioning???

For couple of past months this question has puzzled me at my work. And seeking answers over the internet, I wasn’t able to find a specific answer to the question. So in this post I would only like to highlight my experiences so far with my decision, with no bias to any particular method.

Table Split Vs Partitioning, this decision should be primarily based on the context of usage pattern of the database and type of queries being executed on the database on regular basis/users of the database.


When to split table into smaller tables:
• If the queried table is being scanned on non regular columns (i.e. the queries “Where” clause always changes to different columns within the table)
• If the queries are analytical in nature and direct users of the database are business users.
• If the partition mechanism has to span more than 1024 partition (MySQL limitation)

The disadvantage of splitting the table into multiple tables, highlight problems relevant to querying the database upon multiple tables (with usage of dynamic SQL within stored procedures), complex logic, creation of large number of tables and further more. But these problems outweigh the benefits achieved for analytical purposes once the system is set, keeping it in simple terms with each query upon spliced tables has fewer rows to scan physically and hence forth the results are faster with union all’ed result presented and consistent across any column scan involved in the query.

When to partition a table:
• If the queries are mostly regular in nature or database acts as a backend to the business system (i.e. the majority of queries “Where” clause is using the same column for scan within the table).
• The use of database if limited to storing of records and retrieval of records on standard parameters (i.e. non analytical purposes).
• Where database is being utilized by ORM mechanisms like ADO.NET/Hibernate.
• Foreign keys are not supported on partitioned table.

The disadvantage of partitioned table within an analytical environment is some times more detrimental in terms of performance than the advantages it results into. This is due to the fact when the column scans is performed on the partitioned table upon which the table is not partitioned is employs mysql more effort to scan the each partition of the table for the results and query execution is slower than the table split. But also to mention in spite of the partitioning mechanism used one should also take care of the mechanism of “Partition Pruning” related to the where clause in the select queries illustrating the mysql which partitions to scan for the result.

Performance Results:

In the experiment table contains 28,44,042 rows with "from_date" being indexed:
Note: all the tables in the example are partitioned on the “from_date” column in the table. 


#**Simple Table **
CREATE TABLE `salaries` (
 `emp_no` INT(11) NOT NULL,
 `salary` INT(11) NOT NULL,
 `from_date` DATE NOT NULL,
 `to_date` DATE NOT NULL,
 PRIMARY KEY (`emp_no`, `from_date`),
 INDEX `emp_no` (`emp_no`)
 )
COLLATE='latin1_swedish_ci'
ENGINE=InnoDB
ROW_FORMAT=DEFAULT

#**Partition by month**
CREATE TABLE `salaries_copy` (
  `emp_no` int(11) NOT NULL,
  `salary` int(11) NOT NULL,
  `from_date` date NOT NULL,
  `to_date` date NOT NULL,
  PRIMARY KEY (`emp_no`,`from_date`),
  KEY `emp_no` (`emp_no`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 ROW_FORMAT=COMPACT
/*!50100 PARTITION BY HASH (Month(from_date))
PARTITIONS 12 */

#**Partition by Range**
CREATE TABLE `salaries_copy_1` (
  `emp_no` int(11) NOT NULL,
  `salary` int(11) NOT NULL,
  `from_date` date NOT NULL,
  `to_date` date NOT NULL,
  PRIMARY KEY (`emp_no`,`from_date`),
  KEY `emp_no` (`emp_no`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 ROW_FORMAT=COMPACT
/*!50100 PARTITION BY RANGE (to_days(from_date))
(PARTITION p0 VALUES LESS THAN (to_days('1985-01-01')) ENGINE = InnoDB,
 PARTITION p1 VALUES LESS THAN (to_days(‘1986-01-01’)) ENGINE = InnoDB,
 PARTITION p2 VALUES LESS THAN (to_days(‘1987-01-01’)) ENGINE = InnoDB,
 PARTITION p3 VALUES LESS THAN (to_days('1988-01-01')) ENGINE = InnoDB,
 PARTITION p4 VALUES LESS THAN (to_days('1989-01-01')) ENGINE = InnoDB,
 PARTITION p5 VALUES LESS THAN (to_days('1990-01-01')) ENGINE = InnoDB,
 PARTITION p6 VALUES LESS THAN (to_days('1991-01-01')) ENGINE = InnoDB,
 PARTITION p7 VALUES LESS THAN (to_days('1992-01-01')) ENGINE = InnoDB,
 PARTITION p8 VALUES LESS THAN (to_days('1993-01-01')) ENGINE = InnoDB,
 PARTITION p9 VALUES LESS THAN (to_days('1994-01-01')) ENGINE = InnoDB,
 PARTITION p10 VALUES LESS THAN (to_days('1995-01-01')) ENGINE = InnoDB,
 PARTITION p11 VALUES LESS THAN (to_days('1996-01-01')) ENGINE = InnoDB,
 PARTITION p12 VALUES LESS THAN (to_days('1997-01-01')) ENGINE = InnoDB,
 PARTITION p13 VALUES LESS THAN (to_days('1998-01-01')) ENGINE = InnoDB,
 PARTITION p14 VALUES LESS THAN (to_days('1999-01-01')) ENGINE = InnoDB,
 PARTITION p15 VALUES LESS THAN (to_days('2000-01-01')) ENGINE = InnoDB,
 PARTITION p16 VALUES LESS THAN (to_days('2001-01-01')) ENGINE = InnoDB,
 PARTITION p17 VALUES LESS THAN (to_days('2002-01-01')) ENGINE = InnoDB,
 PARTITION p18 VALUES LESS THAN (to_days('2003-01-01')) ENGINE = InnoDB,
 PARTITION p19 VALUES LESS THAN (to_days('2004-01-01')) ENGINE = InnoDB,
 PARTITION p20 VALUES LESS THAN (to_days('2005-01-01')) ENGINE = InnoDB,
 PARTITION pmax VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */

#********* Test 1: Queries scanning the partitioned column: from_date ***********
Select SQL_NO_CACHE * From salaries tbl
where tbl.from_date >= '2000-03-15' and tbl.from_date < '2000-09-25';
#Duration for 1 query: 0.016 sec. (+ 2.137 sec. network)

Select SQL_NO_CACHE * From salaries_copy tbl
where tbl.from_date >= '2000-03-15' and tbl.from_date < '2000-09-25';
#Duration for 1 query: 2.106 sec. (+ 5.288 sec. network)

Select SQL_NO_CACHE * From salaries_copy_1 tbl
where tbl.from_date >= '2000-03-15' and tbl.from_date < '2000-09-25';
#Duration for 1 query: 0.063 sec. (+ 1.185 sec. network)

Select SQL_NO_CACHE * From salaries_1985
where salaries_1985.from_date >= '2000-03-15' and salaries_1985.from_date < '2000-09-25'
 UNION ALL 
Select * From salaries_1986
where salaries_1986.from_date >= '2000-03-15' and salaries_1986.from_date < '2000-09-25'
 UNION ALL …
…
Select * From salaries_2005
where salaries_2005.from_date >= '2000-03-15' and salaries_2005.from_date < '2000-09-25';
#Duration for 1 queries: 1.638 sec. (+ 0.484 sec. network)

#********* Test 2: Queries scanning the non partitioned column: to_date ***********
Select SQL_NO_CACHE * From salaries tbl
where tbl.to_date >= '2000-03-15' and tbl.to_date < '2000-09-25';
#Duration for 1 query: 0.109 sec. (+ 2.762 sec. network)

Select SQL_NO_CACHE * From salaries_copy tbl
where tbl.to_date >= '2000-03-15' and tbl.to_date < '2000-09-25';
#Duration for 1 query: 1.201 sec. (+ 6.521 sec. network)

Select SQL_NO_CACHE * From salaries_copy_1 tbl
where tbl.to_date >= '2000-03-15' and tbl.to_date < '2000-09-25';
#Duration for 1 query: 7.472 sec. (+ 3.058 sec. network)

Select SQL_NO_CACHE * From salaries_1985
where salaries_1985.to_date >= '2000-03-15' and salaries_1985.to_date < '2000-09-25'
 UNION ALL 
Select * From salaries_1986
where salaries_1986.to_date >= '2000-03-15' and salaries_1986.to_date < '2000-09-25'
 UNION ALL …
…
Select * From salaries_2005
where salaries_2005.to_date >= '2000-03-15' and salaries_2005.to_date < '2000-09-25';
#Duration for 1 query: 1.670 sec. (+ 0.483 sec. network)

Comparison Table:

Query on “from_date Query on “to_date Description
Indexed column Non-Indexed column -
0.016 sec. (+ 2.137 sec. network) 0.109 sec. (+ 2.762 sec. network) Simple table
2.106 sec. (+ 5.288 sec. network) 1.201 sec. (+ 6.521 sec. network) Partition by HASH (Month(from_date))
0.063 sec. (+ 1.185 sec. network) 7.472 sec. (+ 3.058 sec. network) Partition by RANGE (to_days(from_date))
1.638 sec. (+ 0.484 sec. network) 1.670 sec. (+ 0.483 sec. network) Table Split by year(from_date)


Though I do not wish to end this trail in here and I would like to know reader opinions and thoughts about this topic and shed more light whether the table split is better than partitioning or vice versa.

Look forward to your comments …

References: 
Link1: http://dev.mysql.com/doc/refman/5.1/en/partitioning.html

No comments: