This week we discuss test data sets. Ear Candy is errata from Episode 165 on Galera Cluster, and At the Movies is "Runaway Complexity in Big Data"
Test Data Sets
Episode 81, where we talked about different benchmark tools
Sysbench - general version
Percona's sysbench tests
Factors: # tables, concurrency, data set size
Episode 143, where we did an ear candy featuring Morgan Tocker's article on estimating MySQL's working set with INFORMATION_SCHEMA
tpc.org, the Transaction Processing Performance Council's website
TPC-C, the standard benchmarking suite
tpcc-mysql from Percona
tpcc-mysql simple usage steps and how to build graphs with gnuplot article on Percona's website.
Already available data sets:
MySQL's sample data sets - usually too small for load tests.
Daniel Lemire's article about publicly available large data sets for database research - make sure to read the comments!
Data.gov, data compiled by the US government.
Find a dataset by organization
Airline on-time performance data
Episode 63, where we talked about star schemas and fact tables
LOAD DATA INFILE example:
LOAD DATA INFILE LOAD DATA INFILE '/path/to/data.csv' INTO TABLE table_name IGNORE 1 LINES;
In today's ear candy we have errata from episode 165 about Galera cluster - it is by default semi-synchronous replication.
Documentation for wsrep_causal_reads, explaining how to enforce synchronous replication. Galera has semi-synchronous replication by default.
At the movies
This week in at the movies, we present Runaway Complexity in Big Data - and a plan to stop it by Nathan Marz of Twitter. Given at last year's strangeloop conference, this talk outlines several sources of complexity introduced in data systems - lack of human fault-tolerance, conﬂation of data and queries, schemas done wrong - and what can be done to avoid them.