OurSQL Episode 167: Data, Data Everywhere

This week we discuss test data sets. Ear Candy is errata from Episode 165 on Galera Cluster, and At the Movies is "Runaway Complexity in Big Data"

Test Data Sets
Episode 81, where we talked about different benchmark tools
Sysbench - general version
Percona's sysbench tests

Factors: # tables, concurrency, data set size

Episode 143, where we did an ear candy featuring Morgan Tocker's article on estimating MySQL's working set with INFORMATION_SCHEMA

tpc.org, the Transaction Processing Performance Council's website

TPC-C, the standard benchmarking suite
tpcc-mysql from Percona
tpcc-mysql simple usage steps and how to build graphs with gnuplot article on Percona's website.

TPC-DS - for decision support
How to load TPC-DS data using a MySQL server

Already available data sets:
MySQL's sample data sets - usually too small for load tests.

Mozilla's Bugzilla database

Daniel Lemire's article about publicly available large data sets for database research - make sure to read the comments!

Curated collection of links to publicly available large data sets

Data.gov, data compiled by the US government.
Find a dataset by organization
Airline on-time performance data
Episode 63, where we talked about star schemas and fact tables

LOAD DATA INFILE example:
LOAD DATA INFILE LOAD DATA INFILE '/path/to/data.csv' INTO TABLE table_name IGNORE 1 LINES;

pt-log-player
mysqlslap

iibench from Tokutek

Ear Candy
In today's ear candy we have errata from episode 165 about Galera cluster - it is by default semi-synchronous replication.
Documentation for wsrep_causal_reads, explaining how to enforce synchronous replication. Galera has semi-synchronous replication by default.

At the movies
This week in at the movies, we present Runaway Complexity in Big Data - and a plan to stop it by Nathan Marz of Twitter. Given at last year's strangeloop conference, this talk outlines several sources of complexity introduced in data systems - lack of human fault-tolerance, conflation of data and queries, schemas done wrong - and what can be done to avoid them.

Where you can see us
Sheeri will be at the January Boston MySQL Meetup Group on Monday, January 13th.
Gerry will be at the January Seattle MySQL Meetup Group on Monday, January 13th.

Feedback
Facebook group
Google+ page
e-mail: podcast at technocation.org
voicemail using phone/Skype: +1-617-674-2369
twitter: @oursqlcast
or Tweet about @oursqlcast