š Overview
What You'll Learn: How to connect Python to MySQL databases, extract data into pandas DataFrames, and perform data analysis operations - specifically tailored for ClydeEnergy's development workflow.
šÆ Learning Objectives
š ļø Technology Stack
- Database: MySQL Server 8.0+
- Python: Python 3.7+
- Libraries: mysql-connector-python, pandas, SQLAlchemy
- Tools: MySQL Workbench, Jupyter Notebook/IDE
- Data Analysis: pandas, numpy
š What We'll Build
By the end of this handbook, you'll be able to:
- Connect to any MySQL database from Python
- Execute complex SQL queries programmatically
- Load database tables into pandas for analysis
- Handle large datasets efficiently
- Manage database connections securely
š Prerequisites
Required Software
Required Knowledge
šļø Sample Database Setup
We'll use the MySQL sample "world" database, which includes these tables:
world (database)
āāā city (table)
ā āāā ID (int)
ā āāā Name (varchar)
ā āāā CountryCode (char)
ā āāā District (varchar)
ā āāā Population (int)
āāā country (table)
ā āāā Code (char)
ā āāā Name (varchar)
ā āāā Continent (varchar)
ā āāā Region (varchar)
ā āāā Population (int)
ā āāā ... (other columns)
āāā countrylanguage (table)
āāā CountryCode (char)
āāā Language (varchar)
āāā IsOfficial (enum)
āāā Percentage (decimal)
š” Tip: If you don't have the world database, you can download it from the MySQL documentation website or use any existing database you have access to.
āļø Installation & Setup
1
Install Required Python Libraries
Terminal/Command Prompt
# Method 1: Install individually !pip install mysql-connector-python !pip install pandas !pip install SQLAlchemy !pip matplotlib.pyplot !pip install seaborn !pip install numpy # Method 2: Install all at once pip install mysql-connector-python pandas SQLAlchemy # For Jupyter users pip install jupyter # Optional: For enhanced functionality pip install numpy matplotlib seaborn
2
Verify Installation
Python
# Test imports try: import mysql.connector print("ā mysql-connector-python installed successfully") except ImportError: print("ā mysql-connector-python not found") try: import pandas as pd print("ā pandas installed successfully") except ImportError: print("ā pandas not found") try: from sqlalchemy import create_engine print("ā SQLAlchemy installed successfully") except ImportError: print("ā SQLAlchemy not found")
3
Verify MySQL Connection
MySQL Workbench Test
-- Test connection in MySQL Workbench -- 1. Open MySQL Workbench -- 2. Connect to your local instance (usually localhost:3306) -- 3. Run this query to verify connection: SELECT @@version as mysql_version, @@port as mysql_port, DATABASE() as current_database, USER() as current_user; -- Check available databases SHOW DATABASES;
ā ļø Common Installation Issues:
- Permission errors: Use
pip install --user
or run terminal as administrator - Version conflicts: Use virtual environments:
python -m venv mysql_env
- MySQL not found: Ensure MySQL Server is installed and running
š Basic MySQL Connection
Method 1: Using mysql-connector-python
1
Import Required Libraries
Python Imports
# Method 1: Import the connector module import mysql.connector # Method 2: Import specific function (cleaner) from mysql.connector import connect # Import pandas for data manipulation import pandas as pd import mysql.connector from sqlalchemy import create_engine import matplotlib.pyplot as plt import seaborn as sns import numpy as np from datetime import datetime import warnings
2
Define Connection Parameters
Connection Configuration
# Use a configuration dictionary import mysql.connector config = { 'host': 'localhost', 'user': 'root', 'password': 'your_password', 'database': 'world', 'port': 3306 }
3
Establish Connection
Connection Establishment
# Using individual parameters try: connection = mysql.connector.connect( host=config['host'], database=config['database'], user=config['user'], password=config['password'], port=config['port'] ) print("MySQL connection successful!") if connection.is_connected(): cursor = connection.cursor() cursor.execute("SELECT DATABASE()") current_db = cursor.fetchone()[0] print(f"Connected to database: {current_db}") except mysql.connector.Error as e: print(f"Connection failed: {e}") connection = None
š Security Note: Never hardcode passwords in your scripts. Use environment variables or configuration files that are not committed to version control.
š Simple Data Querying
šÆ Reliable Method: Using the buffered cursor approach that works consistently with your database setup.
š Setup & Basic Connection Function
1
Complete Setup Code
Reliable Database Setup
# Reliable setup for data querying import mysql.connector import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import numpy as np from datetime import datetime # Database configuration config = { 'host': 'localhost', 'user': 'root', 'password': 'your_password', 'database': 'world', 'port': 3306 } def create_connection(): """Create a reliable database connection""" try: connection = mysql.connector.connect(**config) print("ā Database connection successful!") return connection except mysql.connector.Error as e: print(f"ā Connection failed: {e}") return None def execute_query(query, connection=None): """Execute a query and return results using buffered cursor""" if connection is None: connection = create_connection() if connection is None: return None try: # Clear any pending results first while True: try: if connection.cursor().nextset(): continue break except Exception: break # Create buffered cursor cursor = connection.cursor(buffered=True) cursor.execute(query) # Get column names columns = [desc[0] for desc in cursor.description] # Fetch all results rows = cursor.fetchall() # Close cursor cursor.close() return rows, columns except mysql.connector.Error as e: print(f"ā Query failed: {e}") return None, None def query_to_dataframe(query, connection=None): """Execute query and return pandas DataFrame""" results, columns = execute_query(query, connection) if results is not None: df = pd.DataFrame(results, columns=columns) return df return None print("š Setup complete! Ready for data querying.") print(f"š Session started: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
š Basic Data Exploration
2
Simple Table Exploration
Basic Queries
# Create connection connection = create_connection() if connection: print("=" * 50) print("š EXPLORING YOUR WORLD DATABASE") print("=" * 50) # 1. Check all tables print("\nšļø Available Tables:") query1 = "SHOW TABLES" results, columns = execute_query(query1, connection) if results: for table in results: print(f" ⢠{table[0]}") # 2. Count records in each table print("\nš Record Counts:") # City table count query2 = "SELECT COUNT(*) FROM city" results, _ = execute_query(query2, connection) if results: print(f" ⢠Cities: {results[0][0]:,}") # Country table count query3 = "SELECT COUNT(*) FROM country" results, _ = execute_query(query3, connection) if results: print(f" ⢠Countries: {results[0][0]:,}") # Language table count query4 = "SELECT COUNT(*) FROM countrylanguage" results, _ = execute_query(query4, connection) if results: print(f" ⢠Languages: {results[0][0]:,}") # 3. Sample data from each table print("\nš Sample Data:") # Sample cities print("\nš Top 5 Most Populous Cities:") query5 = "SELECT Name, CountryCode, Population FROM city ORDER BY Population DESC LIMIT 5" results, columns = execute_query(query5, connection) if results: for row in results: print(f" ⢠{row[0]} ({row[1]}): {row[2]:,} people") # Sample countries print("\nš Sample Countries:") query6 = "SELECT Name, Continent, Population FROM country ORDER BY Population DESC LIMIT 5" results, columns = execute_query(query6, connection) if results: for row in results: print(f" ⢠{row[0]} ({row[1]}): {row[2]:,} people") # Sample languages print("\nš£ļø Sample Languages:") query7 = "SELECT Language, CountryCode, Percentage FROM countrylanguage ORDER BY Percentage DESC LIMIT 5" results, columns = execute_query(query7, connection) if results: for row in results: print(f" ⢠{row[0]} in {row[1]}: {row[2]:.1f}%") connection.close() print("\nā Exploration complete!")
š Useful Query Examples
3
Practical Query Templates
Ready-to-Use Queries
# Practical queries for your World database analysis connection = create_connection() if connection: print("š PRACTICAL QUERY EXAMPLES") print("=" * 40) # 1. Find cities in a specific country print("\n1ļøā£ Cities in USA:") query_usa_cities = "SELECT Name, District, Population FROM city WHERE CountryCode = 'USA' ORDER BY Population DESC LIMIT 10" results, _ = execute_query(query_usa_cities, connection) if results: for city, district, pop in results: print(f" {city}, {district}: {pop:,}") # 2. Countries by continent print("\n2ļøā£ Countries by Continent:") query_continents = """ SELECT Continent, COUNT(*) as Country_Count, SUM(Population) as Total_Population FROM country GROUP BY Continent ORDER BY Total_Population DESC """ results, _ = execute_query(query_continents, connection) if results: for continent, count, total_pop in results: print(f" {continent}: {count} countries, {total_pop:,} people") # 3. Most spoken languages print("\n3ļøā£ Most Widespread Languages:") query_languages = """ SELECT Language, COUNT(*) as Countries_Count, AVG(Percentage) as Avg_Percentage FROM countrylanguage GROUP BY Language HAVING Countries_Count >= 5 ORDER BY Countries_Count DESC LIMIT 10 """ results, _ = execute_query(query_languages, connection) if results: for lang, countries, avg_pct in results: print(f" {lang}: {countries} countries (avg {avg_pct:.1f}%)") # 4. Countries with highest life expectancy print("\n4ļøā£ Highest Life Expectancy:") query_life_exp = """ SELECT Name, LifeExpectancy, Continent FROM country WHERE LifeExpectancy IS NOT NULL ORDER BY LifeExpectancy DESC LIMIT 10 """ results, _ = execute_query(query_life_exp, connection) if results: for country, life_exp, continent in results: print(f" {country} ({continent}): {life_exp:.1f} years") # 5. Large cities (population > 1 million) print("\n5ļøā£ Major Cities (>1M population):") query_major_cities = """ SELECT COUNT(*) as Major_Cities_Count FROM city WHERE Population > 1000000 """ results, _ = execute_query(query_major_cities, connection) if results: print(f" Total cities with >1M people: {results[0][0]}") # 6. Economic data (countries with GNP data) print("\n6ļøā£ Economic Overview:") query_economy = """ SELECT COUNT(*) as Countries_with_GNP, AVG(GNP/Population) as Avg_GNP_Per_Capita, MAX(GNP/Population) as Max_GNP_Per_Capita FROM country WHERE GNP > 0 AND Population > 0 """ results, _ = execute_query(query_economy, connection) if results: count, avg_gnp, max_gnp = results[0] print(f" Countries with economic data: {count}") print(f" Average GNP per capita: ${avg_gnp:.0f}") print(f" Highest GNP per capita: ${max_gnp:.0f}") connection.close() print("\nā Query examples complete!")
š Convert Queries to DataFrames
4
Working with Pandas DataFrames
DataFrame Integration
# Converting queries to pandas DataFrames for analysis connection = create_connection() if connection: print("š CREATING DATAFRAMES FROM QUERIES") print("=" * 45) # 1. Top 20 most populous cities as DataFrame print("\n1ļøā£ Creating Cities DataFrame...") cities_query = """ SELECT Name, CountryCode, District, Population FROM city ORDER BY Population DESC LIMIT 20 """ df_cities = query_to_dataframe(cities_query, connection) if df_cities is not None: print(f" Created DataFrame with {len(df_cities)} cities") print(" Sample data:") print(df_cities.head()) print(f"\n Total population in top 20 cities: {df_cities['Population'].sum():,}") # 2. Countries by continent as DataFrame print("\n2ļøā£ Creating Countries DataFrame...") countries_query = """ SELECT Name, Continent, Population, LifeExpectancy, GNP FROM country WHERE Population > 0 ORDER BY Population DESC """ df_countries = query_to_dataframe(countries_query, connection) if df_countries is not None: print(f" Created DataFrame with {len(df_countries)} countries") print(" Continent distribution:") print(df_countries['Continent'].value_counts()) # Basic statistics print("\n Population Statistics:") print(f" - Average: {df_countries['Population'].mean():,.0f}") print(f" - Median: {df_countries['Population'].median():,.0f}") print(f" - Max: {df_countries['Population'].max():,.0f}") # 3. Languages DataFrame print("\n3ļøā£ Creating Languages DataFrame...") languages_query = """ SELECT Language, CountryCode, IsOfficial, Percentage FROM countrylanguage ORDER BY Percentage DESC """ df_languages = query_to_dataframe(languages_query, connection) if df_languages is not None: print(f" Created DataFrame with {len(df_languages)} language records") # Most common languages top_languages = df_languages['Language'].value_counts().head(10) print(" Most widespread languages:") for lang, count in top_languages.items(): print(f" - {lang}: {count} countries") # 4. Combined analysis DataFrame print("\n4ļøā£ Creating Combined Analysis DataFrame...") combined_query = """ SELECT co.Name as Country, co.Continent, co.Population as Country_Population, co.LifeExpectancy, COUNT(ci.ID) as Number_of_Cities, MAX(ci.Population) as Largest_City_Population FROM country co LEFT JOIN city ci ON co.Code = ci.CountryCode WHERE co.Population > 1000000 GROUP BY co.Code, co.Name, co.Continent, co.Population, co.LifeExpectancy ORDER BY co.Population DESC LIMIT 30 """ df_combined = query_to_dataframe(combined_query, connection) if df_combined is not None: print(f" Created combined DataFrame with {len(df_combined)} countries") print(" Sample combined data:") print(df_combined[['Country', 'Continent', 'Country_Population', 'Number_of_Cities']].head()) # Analysis print(f"\n Average cities per country: {df_combined['Number_of_Cities'].mean():.1f}") print(f" Country with most cities: {df_combined.loc[df_combined['Number_of_Cities'].idxmax(), 'Country']}") connection.close() print("\nā DataFrame creation complete!") # You can now use these DataFrames for further analysis print("\nš” DataFrames available for analysis:") print(" - df_cities: Top 20 most populous cities") print(" - df_countries: All countries with population data") print(" - df_languages: Language distribution data") print(" - df_combined: Countries with city statistics")
š Simple Visualizations
5
Basic Charts from Your Data
Simple Visualizations
# Simple visualizations using your database data import matplotlib.pyplot as plt import seaborn as sns # Set up plotting style plt.style.use('default') sns.set_palette("husl") connection = create_connection() if connection: print("š CREATING VISUALIZATIONS") print("=" * 30) # 1. Population by Continent - Bar Chart print("\n1ļøā£ Creating Population by Continent chart...") continent_query = """ SELECT Continent, SUM(Population) as Total_Population FROM country GROUP BY Continent ORDER BY Total_Population DESC """ df_continent = query_to_dataframe(continent_query, connection) if df_continent is not None: plt.figure(figsize=(12, 6)) plt.bar(df_continent['Continent'], df_continent['Total_Population'] / 1e9) plt.title('World Population by Continent', fontsize=16, fontweight='bold') plt.xlabel('Continent') plt.ylabel('Population (Billions)') plt.xticks(rotation=45) # Add value labels on bars for i, v in enumerate(df_continent['Total_Population'] / 1e9): plt.text(i, v + 0.1, f'{v:.1f}B', ha='center', va='bottom', fontweight='bold') plt.tight_layout() plt.show() # 2. Top 15 Countries by Population - Horizontal Bar Chart print("\n2ļøā£ Creating Top Countries chart...") top_countries_query = """ SELECT Name, Population FROM country ORDER BY Population DESC LIMIT 15 """ df_top_countries = query_to_dataframe(top_countries_query, connection) if df_top_countries is not None: plt.figure(figsize=(12, 8)) plt.barh(range(len(df_top_countries)), df_top_countries['Population'] / 1e6) plt.yticks(range(len(df_top_countries)), df_top_countries['Name']) plt.xlabel('Population (Millions)') plt.title('Top 15 Most Populous Countries', fontsize=16, fontweight='bold') plt.gca().invert_yaxis() # Highest at top # Add value labels for i, v in enumerate(df_top_countries['Population'] / 1e6): plt.text(v + 10, i, f'{v:.0f}M', va='center', fontweight='bold') plt.tight_layout() plt.show() # 3. Life Expectancy Distribution - Histogram print("\n3ļøā£ Creating Life Expectancy distribution...") life_exp_query = """ SELECT LifeExpectancy FROM country WHERE LifeExpectancy IS NOT NULL """ df_life_exp = query_to_dataframe(life_exp_query, connection) if df_life_exp is not None: plt.figure(figsize=(10, 6)) plt.hist(df_life_exp['LifeExpectancy'], bins=20, alpha=0.7, color='skyblue', edgecolor='black') plt.xlabel('Life Expectancy (Years)') plt.ylabel('Number of Countries') plt.title('Distribution of Life Expectancy Worldwide', fontsize=16, fontweight='bold') # Add average line avg_life_exp = df_life_exp['LifeExpectancy'].mean() plt.axvline(avg_life_exp, color='red', linestyle='--', linewidth=2, label=f'Average: {avg_life_exp:.1f} years') plt.legend() plt.tight_layout() plt.show() # 4. Language Distribution - Pie Chart print("\n4ļøā£ Creating Language distribution chart...") lang_query = """ SELECT Language, COUNT(*) as Country_Count FROM countrylanguage WHERE IsOfficial = 'T' GROUP BY Language HAVING Country_Count >= 3 ORDER BY Country_Count DESC LIMIT 10 """ df_languages = query_to_dataframe(lang_query, connection) if df_languages is not None: plt.figure(figsize=(10, 8)) colors = plt.cm.Set3(range(len(df_languages))) plt.pie(df_languages['Country_Count'], labels=df_languages['Language'], autopct='%1.1f%%', colors=colors, startangle=90) plt.title('Most Common Official Languages\n(Countries with 3+ official languages)', fontsize=16, fontweight='bold') plt.axis('equal') plt.tight_layout() plt.show() # 5. Cities vs Countries by Continent - Scatter Plot print("\n5ļøā£ Creating Cities vs Countries scatter plot...") cities_countries_query = """ SELECT co.Continent, COUNT(DISTINCT co.Code) as Countries, COUNT(ci.ID) as Cities FROM country co LEFT JOIN city ci ON co.Code = ci.CountryCode GROUP BY co.Continent """ df_cities_countries = query_to_dataframe(cities_countries_query, connection) if df_cities_countries is not None: plt.figure(figsize=(10, 6)) plt.scatter(df_cities_countries['Countries'], df_cities_countries['Cities'], s=200, alpha=0.7, c=range(len(df_cities_countries)), cmap='viridis') # Add continent labels for i, continent in enumerate(df_cities_countries['Continent']): plt.annotate(continent, (df_cities_countries['Countries'].iloc[i], df_cities_countries['Cities'].iloc[i]), xytext=(5, 5), textcoords='offset points', fontsize=10, fontweight='bold') plt.xlabel('Number of Countries') plt.ylabel('Number of Cities in Database') plt.title('Countries vs Cities by Continent', fontsize=16, fontweight='bold') plt.grid(True, alpha=0.3) plt.tight_layout() plt.show() connection.close() print("\nā All visualizations complete!") print("\nš” Tip: You can modify these queries to explore different aspects of your data!")
š” Usage Tips:
- Buffered Cursor: Always use `cursor(buffered=True)` for reliable results
- Clear Results: Use the nextset() loop to clear any pending results
- Error Handling: Wrap queries in try-except blocks
- Connection Management: Always close connections when done
- DataFrame Integration: Use the helper function for easy pandas integration
ā ļø Important Notes:
- Replace credentials: Update host, user, password with your actual values
- Database name: Change 'world' to your actual database name
- Table names: Adjust table names (city, country, countrylanguage) if different
- Test queries: Always test with LIMIT when exploring large datasets
š Data Extraction
šÆ Goal: Extract data from MySQL into pandas DataFrames for analysis and visualization.