šŸ“‹ Overview

What You'll Learn: How to connect Python to MySQL databases, extract data into pandas DataFrames, and perform data analysis operations - specifically tailored for ClydeEnergy's development workflow.

šŸŽÆ Learning Objectives

šŸ› ļø Technology Stack

  • Database: MySQL Server 8.0+
  • Python: Python 3.7+
  • Libraries: mysql-connector-python, pandas, SQLAlchemy
  • Tools: MySQL Workbench, Jupyter Notebook/IDE
  • Data Analysis: pandas, numpy

šŸ“Š What We'll Build

By the end of this handbook, you'll be able to:

  • Connect to any MySQL database from Python
  • Execute complex SQL queries programmatically
  • Load database tables into pandas for analysis
  • Handle large datasets efficiently
  • Manage database connections securely

šŸ“‹ Prerequisites

Required Software

Required Knowledge

šŸ—„ļø Sample Database Setup

We'll use the MySQL sample "world" database, which includes these tables:

world (database) ā”œā”€ā”€ city (table) │ ā”œā”€ā”€ ID (int) │ ā”œā”€ā”€ Name (varchar) │ ā”œā”€ā”€ CountryCode (char) │ ā”œā”€ā”€ District (varchar) │ └── Population (int) ā”œā”€ā”€ country (table) │ ā”œā”€ā”€ Code (char) │ ā”œā”€ā”€ Name (varchar) │ ā”œā”€ā”€ Continent (varchar) │ ā”œā”€ā”€ Region (varchar) │ ā”œā”€ā”€ Population (int) │ └── ... (other columns) └── countrylanguage (table) ā”œā”€ā”€ CountryCode (char) ā”œā”€ā”€ Language (varchar) ā”œā”€ā”€ IsOfficial (enum) └── Percentage (decimal)
šŸ’” Tip: If you don't have the world database, you can download it from the MySQL documentation website or use any existing database you have access to.

āš™ļø Installation & Setup

1

Install Required Python Libraries

Terminal/Command Prompt
# Method 1: Install individually
!pip install mysql-connector-python
!pip install pandas
!pip install SQLAlchemy
!pip matplotlib.pyplot
!pip install seaborn
!pip install numpy 

# Method 2: Install all at once
pip install mysql-connector-python pandas SQLAlchemy

# For Jupyter users
pip install jupyter

# Optional: For enhanced functionality
pip install numpy matplotlib seaborn
2

Verify Installation

Python
# Test imports
try:
    import mysql.connector
    print("āœ… mysql-connector-python installed successfully")
except ImportError:
    print("āŒ mysql-connector-python not found")

try:
    import pandas as pd
    print("āœ… pandas installed successfully")
except ImportError:
    print("āŒ pandas not found")

try:
    from sqlalchemy import create_engine
    print("āœ… SQLAlchemy installed successfully")
except ImportError:
    print("āŒ SQLAlchemy not found")
3

Verify MySQL Connection

MySQL Workbench Test
-- Test connection in MySQL Workbench
-- 1. Open MySQL Workbench
-- 2. Connect to your local instance (usually localhost:3306)
-- 3. Run this query to verify connection:

SELECT 
    @@version as mysql_version,
    @@port as mysql_port,
    DATABASE() as current_database,
    USER() as current_user;

-- Check available databases
SHOW DATABASES;
āš ļø Common Installation Issues:
  • Permission errors: Use pip install --user or run terminal as administrator
  • Version conflicts: Use virtual environments: python -m venv mysql_env
  • MySQL not found: Ensure MySQL Server is installed and running

šŸ”Œ Basic MySQL Connection

Method 1: Using mysql-connector-python

1

Import Required Libraries

Python Imports
# Method 1: Import the connector module
import mysql.connector

# Method 2: Import specific function (cleaner)
from mysql.connector import connect

# Import pandas for data manipulation
import pandas as pd
import mysql.connector
from sqlalchemy import create_engine
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from datetime import datetime
import warnings
2

Define Connection Parameters

Connection Configuration
# Use a configuration dictionary
import mysql.connector

config = {
    'host': 'localhost',
    'user': 'root',
    'password': 'your_password',
    'database': 'world',
    'port': 3306
}
3

Establish Connection

Connection Establishment
# Using individual parameters
try:
    connection = mysql.connector.connect(
        host=config['host'],
        database=config['database'],
        user=config['user'],
        password=config['password'],
        port=config['port']
    )

    print("MySQL connection successful!")

    if connection.is_connected():
        cursor = connection.cursor()
        cursor.execute("SELECT DATABASE()")
        current_db = cursor.fetchone()[0]
        print(f"Connected to database: {current_db}")

except mysql.connector.Error as e:
    print(f"Connection failed: {e}")
    connection = None
šŸ”’ Security Note: Never hardcode passwords in your scripts. Use environment variables or configuration files that are not committed to version control.

šŸ“Š Simple Data Querying

šŸŽÆ Reliable Method: Using the buffered cursor approach that works consistently with your database setup.

šŸ”„ Setup & Basic Connection Function

1

Complete Setup Code

Reliable Database Setup
# Reliable setup for data querying
import mysql.connector
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from datetime import datetime

# Database configuration
config = {
    'host': 'localhost',
    'user': 'root',
    'password': 'your_password',
    'database': 'world',
    'port': 3306
}

def create_connection():
    """Create a reliable database connection"""
    try:
        connection = mysql.connector.connect(**config)
        print("āœ… Database connection successful!")
        return connection
    except mysql.connector.Error as e:
        print(f"āŒ Connection failed: {e}")
        return None

def execute_query(query, connection=None):
    """Execute a query and return results using buffered cursor"""
    if connection is None:
        connection = create_connection()
        if connection is None:
            return None
    
    try:
        # Clear any pending results first
        while True:
            try:
                if connection.cursor().nextset():
                    continue
                break
            except Exception:
                break
        
        # Create buffered cursor
        cursor = connection.cursor(buffered=True)
        cursor.execute(query)
        
        # Get column names
        columns = [desc[0] for desc in cursor.description]
        
        # Fetch all results
        rows = cursor.fetchall()
        
        # Close cursor
        cursor.close()
        
        return rows, columns
        
    except mysql.connector.Error as e:
        print(f"āŒ Query failed: {e}")
        return None, None

def query_to_dataframe(query, connection=None):
    """Execute query and return pandas DataFrame"""
    results, columns = execute_query(query, connection)
    if results is not None:
        df = pd.DataFrame(results, columns=columns)
        return df
    return None

print("šŸš€ Setup complete! Ready for data querying.")
print(f"šŸ“… Session started: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

šŸ“‹ Basic Data Exploration

2

Simple Table Exploration

Basic Queries
# Create connection
connection = create_connection()

if connection:
    print("=" * 50)
    print("šŸ“Š EXPLORING YOUR WORLD DATABASE")
    print("=" * 50)
    
    # 1. Check all tables
    print("\nšŸ—‚ļø Available Tables:")
    query1 = "SHOW TABLES"
    results, columns = execute_query(query1, connection)
    if results:
        for table in results:
            print(f"  • {table[0]}")
    
    # 2. Count records in each table
    print("\nšŸ“ˆ Record Counts:")
    
    # City table count
    query2 = "SELECT COUNT(*) FROM city"
    results, _ = execute_query(query2, connection)
    if results:
        print(f"  • Cities: {results[0][0]:,}")
    
    # Country table count
    query3 = "SELECT COUNT(*) FROM country"
    results, _ = execute_query(query3, connection)
    if results:
        print(f"  • Countries: {results[0][0]:,}")
    
    # Language table count
    query4 = "SELECT COUNT(*) FROM countrylanguage"
    results, _ = execute_query(query4, connection)
    if results:
        print(f"  • Languages: {results[0][0]:,}")
    
    # 3. Sample data from each table
    print("\nšŸ” Sample Data:")
    
    # Sample cities
    print("\nšŸ“ Top 5 Most Populous Cities:")
    query5 = "SELECT Name, CountryCode, Population FROM city ORDER BY Population DESC LIMIT 5"
    results, columns = execute_query(query5, connection)
    if results:
        for row in results:
            print(f"  • {row[0]} ({row[1]}): {row[2]:,} people")
    
    # Sample countries
    print("\nšŸŒ Sample Countries:")
    query6 = "SELECT Name, Continent, Population FROM country ORDER BY Population DESC LIMIT 5"
    results, columns = execute_query(query6, connection)
    if results:
        for row in results:
            print(f"  • {row[0]} ({row[1]}): {row[2]:,} people")
    
    # Sample languages
    print("\nšŸ—£ļø Sample Languages:")
    query7 = "SELECT Language, CountryCode, Percentage FROM countrylanguage ORDER BY Percentage DESC LIMIT 5"
    results, columns = execute_query(query7, connection)
    if results:
        for row in results:
            print(f"  • {row[0]} in {row[1]}: {row[2]:.1f}%")
    
    connection.close()
    print("\nāœ… Exploration complete!")

šŸ” Useful Query Examples

3

Practical Query Templates

Ready-to-Use Queries
# Practical queries for your World database analysis

connection = create_connection()

if connection:
    print("šŸ”Ž PRACTICAL QUERY EXAMPLES")
    print("=" * 40)
    
    # 1. Find cities in a specific country
    print("\n1ļøāƒ£ Cities in USA:")
    query_usa_cities = "SELECT Name, District, Population FROM city WHERE CountryCode = 'USA' ORDER BY Population DESC LIMIT 10"
    results, _ = execute_query(query_usa_cities, connection)
    if results:
        for city, district, pop in results:
            print(f"   {city}, {district}: {pop:,}")
    
    # 2. Countries by continent
    print("\n2ļøāƒ£ Countries by Continent:")
    query_continents = """
    SELECT Continent, COUNT(*) as Country_Count, SUM(Population) as Total_Population
    FROM country 
    GROUP BY Continent 
    ORDER BY Total_Population DESC
    """
    results, _ = execute_query(query_continents, connection)
    if results:
        for continent, count, total_pop in results:
            print(f"   {continent}: {count} countries, {total_pop:,} people")
    
    # 3. Most spoken languages
    print("\n3ļøāƒ£ Most Widespread Languages:")
    query_languages = """
    SELECT Language, COUNT(*) as Countries_Count, AVG(Percentage) as Avg_Percentage
    FROM countrylanguage 
    GROUP BY Language 
    HAVING Countries_Count >= 5
    ORDER BY Countries_Count DESC 
    LIMIT 10
    """
    results, _ = execute_query(query_languages, connection)
    if results:
        for lang, countries, avg_pct in results:
            print(f"   {lang}: {countries} countries (avg {avg_pct:.1f}%)")
    
    # 4. Countries with highest life expectancy
    print("\n4ļøāƒ£ Highest Life Expectancy:")
    query_life_exp = """
    SELECT Name, LifeExpectancy, Continent
    FROM country 
    WHERE LifeExpectancy IS NOT NULL
    ORDER BY LifeExpectancy DESC 
    LIMIT 10
    """
    results, _ = execute_query(query_life_exp, connection)
    if results:
        for country, life_exp, continent in results:
            print(f"   {country} ({continent}): {life_exp:.1f} years")
    
    # 5. Large cities (population > 1 million)
    print("\n5ļøāƒ£ Major Cities (>1M population):")
    query_major_cities = """
    SELECT COUNT(*) as Major_Cities_Count
    FROM city 
    WHERE Population > 1000000
    """
    results, _ = execute_query(query_major_cities, connection)
    if results:
        print(f"   Total cities with >1M people: {results[0][0]}")
    
    # 6. Economic data (countries with GNP data)
    print("\n6ļøāƒ£ Economic Overview:")
    query_economy = """
    SELECT COUNT(*) as Countries_with_GNP, 
           AVG(GNP/Population) as Avg_GNP_Per_Capita,
           MAX(GNP/Population) as Max_GNP_Per_Capita
    FROM country 
    WHERE GNP > 0 AND Population > 0
    """
    results, _ = execute_query(query_economy, connection)
    if results:
        count, avg_gnp, max_gnp = results[0]
        print(f"   Countries with economic data: {count}")
        print(f"   Average GNP per capita: ${avg_gnp:.0f}")
        print(f"   Highest GNP per capita: ${max_gnp:.0f}")
    
    connection.close()
    print("\nāœ… Query examples complete!")

šŸ“Š Convert Queries to DataFrames

4

Working with Pandas DataFrames

DataFrame Integration
# Converting queries to pandas DataFrames for analysis

connection = create_connection()

if connection:
    print("šŸ“Š CREATING DATAFRAMES FROM QUERIES")
    print("=" * 45)
    
    # 1. Top 20 most populous cities as DataFrame
    print("\n1ļøāƒ£ Creating Cities DataFrame...")
    cities_query = """
    SELECT Name, CountryCode, District, Population
    FROM city 
    ORDER BY Population DESC 
    LIMIT 20
    """
    df_cities = query_to_dataframe(cities_query, connection)
    if df_cities is not None:
        print(f"   Created DataFrame with {len(df_cities)} cities")
        print("   Sample data:")
        print(df_cities.head())
        print(f"\n   Total population in top 20 cities: {df_cities['Population'].sum():,}")
    
    # 2. Countries by continent as DataFrame
    print("\n2ļøāƒ£ Creating Countries DataFrame...")
    countries_query = """
    SELECT Name, Continent, Population, LifeExpectancy, GNP
    FROM country 
    WHERE Population > 0
    ORDER BY Population DESC
    """
    df_countries = query_to_dataframe(countries_query, connection)
    if df_countries is not None:
        print(f"   Created DataFrame with {len(df_countries)} countries")
        print("   Continent distribution:")
        print(df_countries['Continent'].value_counts())
        
        # Basic statistics
        print("\n   Population Statistics:")
        print(f"   - Average: {df_countries['Population'].mean():,.0f}")
        print(f"   - Median: {df_countries['Population'].median():,.0f}")
        print(f"   - Max: {df_countries['Population'].max():,.0f}")
    
    # 3. Languages DataFrame
    print("\n3ļøāƒ£ Creating Languages DataFrame...")
    languages_query = """
    SELECT Language, CountryCode, IsOfficial, Percentage
    FROM countrylanguage 
    ORDER BY Percentage DESC
    """
    df_languages = query_to_dataframe(languages_query, connection)
    if df_languages is not None:
        print(f"   Created DataFrame with {len(df_languages)} language records")
        
        # Most common languages
        top_languages = df_languages['Language'].value_counts().head(10)
        print("   Most widespread languages:")
        for lang, count in top_languages.items():
            print(f"   - {lang}: {count} countries")
    
    # 4. Combined analysis DataFrame
    print("\n4ļøāƒ£ Creating Combined Analysis DataFrame...")
    combined_query = """
    SELECT 
        co.Name as Country,
        co.Continent,
        co.Population as Country_Population,
        co.LifeExpectancy,
        COUNT(ci.ID) as Number_of_Cities,
        MAX(ci.Population) as Largest_City_Population
    FROM country co
    LEFT JOIN city ci ON co.Code = ci.CountryCode
    WHERE co.Population > 1000000
    GROUP BY co.Code, co.Name, co.Continent, co.Population, co.LifeExpectancy
    ORDER BY co.Population DESC
    LIMIT 30
    """
    df_combined = query_to_dataframe(combined_query, connection)
    if df_combined is not None:
        print(f"   Created combined DataFrame with {len(df_combined)} countries")
        print("   Sample combined data:")
        print(df_combined[['Country', 'Continent', 'Country_Population', 'Number_of_Cities']].head())
        
        # Analysis
        print(f"\n   Average cities per country: {df_combined['Number_of_Cities'].mean():.1f}")
        print(f"   Country with most cities: {df_combined.loc[df_combined['Number_of_Cities'].idxmax(), 'Country']}")
    
    connection.close()
    print("\nāœ… DataFrame creation complete!")
    
    # You can now use these DataFrames for further analysis
    print("\nšŸ’” DataFrames available for analysis:")
    print("   - df_cities: Top 20 most populous cities")
    print("   - df_countries: All countries with population data")
    print("   - df_languages: Language distribution data")
    print("   - df_combined: Countries with city statistics")

šŸ“ˆ Simple Visualizations

5

Basic Charts from Your Data

Simple Visualizations
# Simple visualizations using your database data

import matplotlib.pyplot as plt
import seaborn as sns

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")

connection = create_connection()

if connection:
    print("šŸ“Š CREATING VISUALIZATIONS")
    print("=" * 30)
    
    # 1. Population by Continent - Bar Chart
    print("\n1ļøāƒ£ Creating Population by Continent chart...")
    continent_query = """
    SELECT Continent, SUM(Population) as Total_Population
    FROM country 
    GROUP BY Continent 
    ORDER BY Total_Population DESC
    """
    df_continent = query_to_dataframe(continent_query, connection)
    
    if df_continent is not None:
        plt.figure(figsize=(12, 6))
        plt.bar(df_continent['Continent'], df_continent['Total_Population'] / 1e9)
        plt.title('World Population by Continent', fontsize=16, fontweight='bold')
        plt.xlabel('Continent')
        plt.ylabel('Population (Billions)')
        plt.xticks(rotation=45)
        
        # Add value labels on bars
        for i, v in enumerate(df_continent['Total_Population'] / 1e9):
            plt.text(i, v + 0.1, f'{v:.1f}B', ha='center', va='bottom', fontweight='bold')
        
        plt.tight_layout()
        plt.show()
    
    # 2. Top 15 Countries by Population - Horizontal Bar Chart
    print("\n2ļøāƒ£ Creating Top Countries chart...")
    top_countries_query = """
    SELECT Name, Population 
    FROM country 
    ORDER BY Population DESC 
    LIMIT 15
    """
    df_top_countries = query_to_dataframe(top_countries_query, connection)
    
    if df_top_countries is not None:
        plt.figure(figsize=(12, 8))
        plt.barh(range(len(df_top_countries)), df_top_countries['Population'] / 1e6)
        plt.yticks(range(len(df_top_countries)), df_top_countries['Name'])
        plt.xlabel('Population (Millions)')
        plt.title('Top 15 Most Populous Countries', fontsize=16, fontweight='bold')
        plt.gca().invert_yaxis()  # Highest at top
        
        # Add value labels
        for i, v in enumerate(df_top_countries['Population'] / 1e6):
            plt.text(v + 10, i, f'{v:.0f}M', va='center', fontweight='bold')
        
        plt.tight_layout()
        plt.show()
    
    # 3. Life Expectancy Distribution - Histogram
    print("\n3ļøāƒ£ Creating Life Expectancy distribution...")
    life_exp_query = """
    SELECT LifeExpectancy 
    FROM country 
    WHERE LifeExpectancy IS NOT NULL
    """
    df_life_exp = query_to_dataframe(life_exp_query, connection)
    
    if df_life_exp is not None:
        plt.figure(figsize=(10, 6))
        plt.hist(df_life_exp['LifeExpectancy'], bins=20, alpha=0.7, color='skyblue', edgecolor='black')
        plt.xlabel('Life Expectancy (Years)')
        plt.ylabel('Number of Countries')
        plt.title('Distribution of Life Expectancy Worldwide', fontsize=16, fontweight='bold')
        
        # Add average line
        avg_life_exp = df_life_exp['LifeExpectancy'].mean()
        plt.axvline(avg_life_exp, color='red', linestyle='--', linewidth=2, 
                   label=f'Average: {avg_life_exp:.1f} years')
        plt.legend()
        
        plt.tight_layout()
        plt.show()
    
    # 4. Language Distribution - Pie Chart
    print("\n4ļøāƒ£ Creating Language distribution chart...")
    lang_query = """
    SELECT Language, COUNT(*) as Country_Count
    FROM countrylanguage 
    WHERE IsOfficial = 'T'
    GROUP BY Language 
    HAVING Country_Count >= 3
    ORDER BY Country_Count DESC
    LIMIT 10
    """
    df_languages = query_to_dataframe(lang_query, connection)
    
    if df_languages is not None:
        plt.figure(figsize=(10, 8))
        colors = plt.cm.Set3(range(len(df_languages)))
        plt.pie(df_languages['Country_Count'], labels=df_languages['Language'], 
               autopct='%1.1f%%', colors=colors, startangle=90)
        plt.title('Most Common Official Languages\n(Countries with 3+ official languages)', 
                 fontsize=16, fontweight='bold')
        plt.axis('equal')
        plt.tight_layout()
        plt.show()
    
    # 5. Cities vs Countries by Continent - Scatter Plot
    print("\n5ļøāƒ£ Creating Cities vs Countries scatter plot...")
    cities_countries_query = """
    SELECT 
        co.Continent,
        COUNT(DISTINCT co.Code) as Countries,
        COUNT(ci.ID) as Cities
    FROM country co
    LEFT JOIN city ci ON co.Code = ci.CountryCode
    GROUP BY co.Continent
    """
    df_cities_countries = query_to_dataframe(cities_countries_query, connection)
    
    if df_cities_countries is not None:
        plt.figure(figsize=(10, 6))
        plt.scatter(df_cities_countries['Countries'], df_cities_countries['Cities'], 
                   s=200, alpha=0.7, c=range(len(df_cities_countries)), cmap='viridis')
        
        # Add continent labels
        for i, continent in enumerate(df_cities_countries['Continent']):
            plt.annotate(continent, 
                        (df_cities_countries['Countries'].iloc[i], df_cities_countries['Cities'].iloc[i]),
                        xytext=(5, 5), textcoords='offset points', fontsize=10, fontweight='bold')
        
        plt.xlabel('Number of Countries')
        plt.ylabel('Number of Cities in Database')
        plt.title('Countries vs Cities by Continent', fontsize=16, fontweight='bold')
        plt.grid(True, alpha=0.3)
        plt.tight_layout()
        plt.show()
    
    connection.close()
    print("\nāœ… All visualizations complete!")
    print("\nšŸ’” Tip: You can modify these queries to explore different aspects of your data!")
šŸ’” Usage Tips:
  • Buffered Cursor: Always use `cursor(buffered=True)` for reliable results
  • Clear Results: Use the nextset() loop to clear any pending results
  • Error Handling: Wrap queries in try-except blocks
  • Connection Management: Always close connections when done
  • DataFrame Integration: Use the helper function for easy pandas integration
āš ļø Important Notes:
  • Replace credentials: Update host, user, password with your actual values
  • Database name: Change 'world' to your actual database name
  • Table names: Adjust table names (city, country, countrylanguage) if different
  • Test queries: Always test with LIMIT when exploring large datasets

šŸ“Š Data Extraction

šŸŽÆ Goal: Extract data from MySQL into pandas DataFrames for analysis and visualization.