This documentation intends to guide you through the process of transferring the complete FreeDB-database (a tremendous large bunch of tiny xmcd-textfiles) to a MySQL-database using freedbimporter.jar - many steps will just be explained in a click-here-click-there-manner. Along this some questions may arise and some things might be worth mentioning - some of these are shifted into the appendix of this document (the internals-section).

If you want to set up a MySQL-port from FreeDB you have to be patient. A complete transfer (including a final creation of MySQL-indices) might take 16 to 20 hours.

Prerequisites

You have downloaded the latest FreeDB-database (a 700-MB-tar-file) from http://ftp.freedb.org/pub/freedb/ and unpacked all of the about 3 million tiny textfiles to C:\install\freedb. Your system has Java 1.5 (or higher) installed. You have downloaded freedbimporter.jar from http://sourceforge.net/projects/freedbimporter/files/. You have a MySQL-server 5.0 running on localhost (port 3306) and there is the superuser named root without any password required. To avoid OutOfMemoryErrors you start freedbimporter.jar via commandline with something like this:

java -Xms256m -Xmx512m -jar freedbimporter.jar

If you face sever problems following the instructions here please take a look at the debug-section below. For recent (February 2012) NullPointerExceptions consider advise to change charset-detector.

Merging textfiles from FreeDB

First you reduce the number of about three million tiny textfiles in C:\install\freedb by melting or merging them into only few (about 1500) bigger textfiles in C:\install\freedb.merged. This merge itself takes time (about four hours) but it speeds up subsequent building the MySQL-database and enforces UTF-8-encoding for all merged files.

To start merging you pick first button Merge FreeDB-textfiles from the four-sliced MainGUI

MainGUI first selected

and the MergerConfigurator opens.

MergerConfigurator

This initial merging is crucial for a successful transfer from FreeDB's database (xmcd-metadata stored in textfiles) to MySQL. The xmcd-text might be encoded as US-ASCII, ISO-8859-1 (aka LATIN1) or UTF-8.

For merging you can either stick to ISO-8859-1 or pick UTF-8 to be standard-encoding and assume that all files provided by FreeDB are encoded this way. As this is obviously wrong you can choose an additional filter to check out different encodings for each and every textfile to merge. This package contains IBM's ICU and Mozilla's Chardet to detect files' encodings. You can use them both in an intersection mode that only picks a different encoding for a read file if ICU and Chardet spot it simultaneously. At the same time you can switch off any of these both and assume that your chosen standard-encoding fits all files - doing so is a bad idea.

Note: There seems to be a problem with ICU's charset-detection since the beginning of 2012. You are strongly advised to switch the charset-detector to Chardet if you perform merging on the update-sections of FreeDB-Data.

After you picked source- and target-directory the merger is ready to go

MergerConfigurator ready to go

Now you Start merge. After the merging succeeded (which takes three to four hours) you receive a report and find about 600 textfiles in C:\install\freedb.merged - this merged dataset is completely encoded as UTF-8.

Transferring textfiles to MySQL-Database

The transfer or spooling of FreeDB-data to MySQL-server distinguishes two different situations. Building a MySQL-port from scratch is very time-consuming especially if you decide to index the data for speed and to validate it for consistency. Continuing an existing MySQL-database with all its indices will take even longer. Presumably you just do one initial setup for the MySQL-database, build the indices and validate it while any later monthly release of the FreeDB-database continues the existing port be very patient doing so!

For both scenarios let's assume that your MySQL-server is a freshly installed default-system running on localhost (IP 127.0.0.1), listening on port 3306 with the superuser named root and without any password for this superuser.

Moreover let's assume that you already merged the original dataset from FreeDB so that all textfiles to transfer are already UFT-8-encoded.

Building a new MySQL-database

First step: Creating target-database on the MySQL-server

To create a new MySQL-database you pick second button Manage MySQL-Database from the four-sliced MainGUI

MainGUI second selected

and the DatabaseConfigurator opens.

DatabaseConfigurator default start

You need to connect to your MySQL-server and click Connect. The MySQLConfigurator shows up:

MySQLConfigurator default start

For this connection that just issues the initial SQL-CREATE-sequence the last three options Adapter, Genre-Blacklist and Genre-Whitelist are irrelevant so they will be discussed later.

The only interesting option is Tables. This setting defines the number of tables in the database and should be set either to 2 large tables or to 4 smaller tables. Take a look at the internals below to figure out whether to create two or four tables. If you do not want to get into details and just start investigating FreeDB you select the two-tabled version. The only important thing here: the server you are going to connect does not already have a database named freedb.

Connect to get back to DatabaseConfigurator.

DatabaseConfigurator ready to create database

To create the target-database freedb just press Go!. If there has been no database named freedb on localhost you instantly can confirm success:

MySQLConnector database create ok

By closing the DatabaseConfigurator's window in the upper right corner you retain the root-connection to MySQL-server opened. You can use it later after filling the freshly created database to validate it (an option) and to index it (a recommendation).

Second step: Filling a MySQL-database

To fill a MySQL-database you pick third button Spool textfiles to MySQL-Database from the four-sliced MainGUI

MainGUI third selected

and the SpoolConfigurator shows up:

SpoolConfigurator default start

First you configure the MySQL-Connection again.

MySQLConfigurator default start

The settings for the MySQL-Connection are roughly the same they were when creating the database. As the database freedb now exists you can stick to option Tables' default unset (autodetect).

In this situation there are the three options Adapter, Genre-Blacklist and Genre-Whitelist to consider.

Filling the MySQL-database is often referred to by spooling or transferring - these three terms all denote the same procedure: reading xmcd-textfiles from FreeDB (handles different encodings, might fail and loose some data or mistake encoding and produce garbage), parsing those files (coping with xmcd-violations, might loose some data) and generating some SQL-statements that use the parsed and partially postprocessed data. Just as the both first steps reading and parsing this second step (sending INSERT-statements) looses data due to the content of the data or some of its characters that the MySQL-server might not be able to store in its database.

One consequence is that neither the count of tiny textfiles received from FreeDB nor the number of CDs parsed from merged data determines the number of CDs that will be stored in your MySQL-database.

An intermediate problem is that many CD-descriptions are either rubbish or they contain mistyped strings (names, titles or genres). Most of the postprocessing takes place as successful parsed CDs get spooled to MySQL so this processing is introduced and configured here.

A MySQL-connection takes a CD and adapts it before the connector tries to generate INSERT-statements and sends them to MySQL in order to store the CD that connector just received. Any MySQL-connector can have one of the following adapters:

CDAdapter   does not change any data received and does not try to determine the CD's language.
AnnotatingAdapter   does not change data but adds a language-code, if the CD's language is recognized.
ValidatingAdapter   tries to eliminate mistyped artists, genres or titles and filters out some rubbish. This adapter does not try to determine CD's language.
ValidatingAnnotatingAdapter   tries to eliminate mistyped artists, genres or titles and filters out some rubbish. This adapter tries to determine CD's language. It also can use a black- or a whitelist to reject or accept CDs depending on their genre.

Determining a CD's language is a rather time-consuming task that takes almost two hours to check for the European languages German, English, French, Italian, Portuguese or Spanish.

Validating CD's data means either rejecting strings for being too short or a given genre is blacklisted or - quite the contrary - not whitelisted. You can use a black- or a whitelist to drop some data before it even gets spooled to MySQL. Here is a longish whitelist that seems to be a useful filter. If you like you can combine a black- and a whitelist for more sophisticated filtering.

Click Connect to return to SpoolConfigurator. After you picked your source-directory C:\install\freedb.merged (containing only UTF-8-encoded textfiles) you are ready to start.

SpoolConfigurator ready to go

Start spooling and be patient. Four to five hours will elapse before you will receive a report that does not give an accurate count of CDs in your database. While the CDs in the database themselves are valid there are some holes in the interior numbering-sequences of the database. These holes are harmless but they yield an incorrect count of CDs stored. To correct that you validate your database before creating indices.

By closing the SpoolConfigurator's window in the upper right corner you retain the root-connection to MySQL-server opened. Think about closing this connection first.

Third step: Validating database built and create indices

Now your database contains the FreeDB-data. But there are some holes in the interior numbering-sequences of the database and the data has no indices. So you turn to Manage MySQL-Database again.

MainGUI second selected

Prepare to validate the database (an option, it takes about two hours again) and create indices (a recommendation, it takes four to six hours and causes massive harddisc-activity, your MySQL-server might be nearly gone meanwhile).

DatabaseConfigurator ready to validate and create indices

Instead of a detailed report you receive a brief notice when MySQL finishes.

MySQLConnector validate and create indices ok

Continuing an existing MySQL-database

Continuing an existing MySQL-database is roughly the same procedure you took to build the database. You download the latest tar-ball and extract it to C:\install\freedb and skip the initial database-creation. You could even skip merging.

Merging is a very good idea if you expect to have more then just a single pass through the source-directory. Once you have found an appropriate configuration for your MySQL-database (two- or four-tabled) and you know how to set up the spooler's connection to the MySQL-server it might be sufficient to configure the spooler itself to look for the text-files' encodings while it traverses the unmerged source-directory C:\install\freedb.

The only important difference to the initial setup of a FreeDB-port is the end of the transfer-procedure: If the indices for that database already have been created and existed before the latest spooling then you do not need to (re-) create them as they have been maintained while transferring new CDs to your MySQL-database. The only thing you are recommended to do is to validate the database even after you just spooled few new CDs.

Internals

Links

  1. FreeDB
  2. freedbimporter.jar
  3. xmcd
  4. Mozilla's JChardet
  5. IBM's ICU
  6. whitelist for a validating Adapter

Encoding issues

Encoding is problematic at source (the textfiles) and at destination (the MySQL-database) of spooling. I did not examine these tons of text-files from FreeDB and just rely on Chardet / ICU to detect appropriate encodings. David Alimian was not too confident that Chardet or ICU might find accurate encodings.

The second thing is the encoding of the MySQL-databases. As there are two different schemata there also are two different encodings. The two-tabled-version of the database uses a LATIN2-encoding whereas the four-tabled-version uses UTF8-encoding. The later seems to be very piggy about the CDs delivered and does not store the same amount of CDs as its two-tabled stepbrother.

I have not investigated this and can not tell whether the higher count of CDs accepted by the LATIN2-encoded database is due to its superior encoding or if it just accepts much more garbage than the UTF8-encoded version.

Two schemata of MySQL-databases

There are two different database-schemata. First uses two tables and is referred to as freedblarge for both tables (discs and tracks) are not too normalized and contain more textual information than the same tables within the second schema that uses four tables and is referred to as freedbsmall. This small variant externalizes artists in a separate table and lists genres in a fourth table. The two-tabled (large) version of the database is LATIN2-encoded whereas the four-tables (small) version is UTF8-encoded.

The storage-space these different database-schemata require is just the opposite of what the names suggest: An unfiltered large FreeDB-port stores about 2 million CDs within two tables and uses about 3,1 GB on harddisc - that's a ratio harddisc to CDs of 1,55 for a large schema. An unfiltered small FreeDB-port stores about 1,6 million CDs within four tables and occupies 2,5 GB - that's a ratio harddisc to CDs of 1,56 for a small schema.

Two-tabled freedblarge

The following CREATE describes the two-tabled database-schema used by freedbimporter.jar to store FreeDB's data. This database freedblarge handels all chars LANTIN2-encoded. The indices you can create within third pass are appended here.

CREATE DATABASE IF NOT EXISTS freedblarge DEFAULT CHARACTER SET LATIN2;
DROP TABLE IF EXISTS freedblarge.discs;
CREATE TABLE IF NOT EXISTS freedblarge.discs (
  did BIGINT(10) UNSIGNED NOT NULL,
  freedbdiscid BIGINT(10) UNSIGNED NOT NULL,
  aname VARCHAR(203) NOT NULL,
  dtitle VARCHAR(187) NOT NULL,
  gtitle VARCHAR(106) NOT NULL,
  dreleased YEAR(4) DEFAULT NULL,
  dtracks INT(4) UNSIGNED NOT NULL,
  dseconds INT(4) UNSIGNED NOT NULL,
  dlang ENUM ('deu', 'fra', 'ita', 'por', 'spa', 'eng') DEFAULT NULL,
  UNIQUE(aname, dtitle, dtracks),
  PRIMARY KEY(did)
) ENGINE = MyISAM DEFAULT CHARACTER SET LATIN2;
DROP TABLE IF EXISTS freedblarge.tracks;
CREATE TABLE IF NOT EXISTS freedblarge.tracks (
  discs_did BIGINT(10) UNSIGNED NOT NULL,
  tnumber INT(4) UNSIGNED NOT NULL,
  ttitle VARCHAR(255) NOT NULL,
  aname VARCHAR(203) NOT NULL,
  tseconds INT(4) UNSIGNED NOT NULL,
  PRIMARY KEY(discs_did, tnumber)
) ENGINE = MyISAM DEFAULT CHARACTER SET LATIN2;

CREATE INDEX discs_index ON freedblarge.discs (
  freedbdiscid, aname, dtitle, gtitle, dreleased, dseconds
);
CREATE INDEX tracks_index ON freedblarge.tracks (
  ttitle, aname
);

Four-tabled freedbsmall

The following CREATE describes the four-tabled database-schema used by freedbimporter.jar to store FreeDB's data. This database freedbsmall handels all chars UTF8-encoded. The indices you can create within third pass are appended.
CREATE DATABASE IF NOT EXISTS freedbsmall DEFAULT CHARACTER SET UTF8;
DROP TABLE IF EXISTS freedbsmall.artists;
CREATE TABLE IF NOT EXISTS freedbsmall.artists (
  aid BIGINT(10) UNSIGNED NOT NULL,
  aname VARCHAR(255) NOT NULL,
  PRIMARY KEY(aid),
  UNIQUE(aname)
) ENGINE = MyISAM DEFAULT CHARACTER SET UTF8;
DROP TABLE IF EXISTS freedbsmall.genres;
CREATE TABLE IF NOT EXISTS freedbsmall.genres (
  gid BIGINT(10) UNSIGNED NOT NULL,
  gtitle VARCHAR(255) NOT NULL,
  PRIMARY KEY(gid),
  UNIQUE(gtitle)
) ENGINE = MyISAM DEFAULT CHARACTER SET UTF8;
DROP TABLE IF EXISTS freedbsmall.discs;
CREATE TABLE IF NOT EXISTS freedbsmall.discs (
  did BIGINT(10) UNSIGNED NOT NULL,
  freedbdiscid BIGINT(10) UNSIGNED NOT NULL,
  artists_aid BIGINT(10) UNSIGNED NOT NULL,
  dtitle VARCHAR(241) NOT NULL,
  genres_gid BIGINT(10) UNSIGNED NOT NULL,
  dreleased YEAR(4) DEFAULT NULL,
  dtracks INT(4) UNSIGNED NOT NULL,
  dseconds INT(4) UNSIGNED NOT NULL,
  dlang ENUM ('deu', 'fra', 'ita', 'por', 'spa', 'eng') DEFAULT NULL,
  UNIQUE(artists_aid, dtitle, dtracks),
  PRIMARY KEY(did)
) ENGINE = MyISAM DEFAULT CHARACTER SET UTF8;
DROP TABLE IF EXISTS freedbsmall.tracks;
CREATE TABLE IF NOT EXISTS freedbsmall.tracks (
  discs_did BIGINT(10) UNSIGNED NOT NULL,
  tnumber INT(4) UNSIGNED NOT NULL,
  artists_aid BIGINT(10) UNSIGNED NOT NULL,
  ttitle VARCHAR(255) NOT NULL,
  tseconds INT(4) UNSIGNED NOT NULL,
  PRIMARY KEY(discs_did, tnumber)
) ENGINE = MyISAM DEFAULT CHARACTER SET UTF8;

CREATE INDEX artists_index ON freedbsmall.artists (
  aname
);
CREATE INDEX genres_index ON freedbsmall.genres (
  gtitle
);
CREATE INDEX discs_index ON freedbsmall.discs (
  freedbdiscid, artists_aid, dtitle, genres_gid, dreleased, dseconds
);
CREATE INDEX tracks_index ON freedbsmall.tracks (
  ttitle
);

Debugging

freedbimporter has been started eleven years ago. Development came to rest about two years ago. The few mails that arrive indicate that the most important problem using freedbimporter is the very poor information one gets if interaction with the mysql-server fails.

First step to gain some more information is to download the logging-property-file log4j.properties and put it into the same directory that contains freedbimporter.jar. You open a commandline there and issue

java -Xms512m -Xmx768m -jar freedbimporter.jar --log4jProperties log4j.properties

From now on freedbimporter logs its interaction with the mysql-server to a file named freedbimporter.log. This log should reveal what is going wrong. Be careful repeating your steps: The log grows big very fast and soon becomes much too large to send it to me ;-)

TODOs

Some things have been left to do from time immemorial. To my mind these are:

at least since   issue  
July 2015   Dropped AUTOINCREMENT and made Spooler count transferred discs, artists, genres and tracks itself. Changed validation accordingly.
February 2012   Check for NullPointerException's source: ICU terminated with a NullPointerException merging freedb-update-20120101-20120201. Presumably ICU will be just the same with freedb-complete-20120201.tar. Using jChardet avoids NullPointerExceptions. This problem unfortunately is not fixed with integration of latest ICU-package in March 2012. If merging crashes (freezes) the importer without further notice ICU triggered a NullPointerExceptions and you should stick to charset-detector Chardet to get the merger going.
December 2010   take some screenshots from current GUI
December 2009   redesign and redo language-annotation (ease language annotation for languages not spotted yet)
September 2009   take disc's track-count into index (prepares quicker fuzzy retrieval form rudimentarily implemented cddb-substitute)
November 2008   implement prepared statements on mysql-server-side (should speed up spooling)



E-Mail: Christian Kruggel Version 2.1 by 21.10.2015