Tabular Module¶
Overview¶
The tabular module provides a collection of utility functions for exploratory data analysis and data profiling of tabular data. It includes functions for summarizing DataFrames, analyzing missing values, profiling numeric columns, and examining correlations.
Utilities¶
1. summary(df)¶
Provides a comprehensive one-shot overview of the DataFrame combining shape, data types, basic statistics, missing value counts, and cardinality information.
Parameters:
- df (pd.DataFrame): The input DataFrame
Returns: - pd.DataFrame: Summary statistics including dtype, missing counts, percentages, and unique values
Example:
import pandas as pd
from fenn.tabular import summary
# Create sample DataFrame
df = pd.DataFrame({
'age': [25, 30, None, 35, 40],
'salary': [50000, 60000, 55000, None, 70000],
'department': ['HR', 'IT', 'IT', 'HR', 'Finance'],
'score': [8.5, 9.2, 7.8, 8.1, 9.0]
})
# Get summary
summary_stats = summary(df)
print(summary_stats)
Output:
Shape: 5 rows x 4 columns
dtype missing missing_% unique
age int64 1 20.0 4
salary int64 1 20.0 4
department object 0 0.0 3
score float64 0 0.0 5
2. missing_report(df)¶
Generates a compact report of missing values per column with percentages and flags for all-null or almost-all-null columns.
Parameters:
- df (pd.DataFrame): The input DataFrame
Returns: - pd.DataFrame: Report showing missing counts, percentages, and flags for problematic columns
Example:
from fenn.tabular import missing_report
missing = missing_report(df)
print(missing)
Output:
missing missing_% all_null almost_null
age 1 20.0 False False
salary 1 20.0 False False
3. numeric_profile(df, clip_quantile=None)¶
Describes numeric columns only with statistics (min, max, mean, std, quantiles) and optional clipping of extreme quantiles.
Parameters:
- df (pd.DataFrame): The input DataFrame
- clip_quantile (float, optional): Quantile threshold for clipping extreme values (e.g., 0.05 clips at 5th and 95th percentiles)
Returns: - pd.DataFrame: Descriptive statistics for numeric columns
Example:
from fenn.tabular import numeric_profile
# Profile without clipping
profile = numeric_profile(df)
print(profile)
# Profile with clipping (remove top and bottom 5%)
profile_clipped = numeric_profile(df, clip_quantile=0.05)
print(profile_clipped)
Output:
count mean std min 25% 50% 75% max
age 4.0 32.5000 6.454972 25.0 28.75 32.5 36.25 40.0
salary 4.0 58750.0 8541.666667 50000.0 53750.0 57500.0 63750.0 70000.0
score 5.0 8.5200 0.579898 7.8 8.1 8.5 9.0 9.2
4. quick_sample(df, n=5, columns=None, seed=None)¶
Convenience wrapper for sampling rows from a DataFrame with optional column subset and random seed control.
Parameters:
- df (pd.DataFrame): The input DataFrame
- n (int): Number of rows to sample (default: 5)
- columns (list, optional): Specific columns to include in the sample
- seed (int, optional): Random seed for reproducibility
Returns: - pd.DataFrame: Random sample of rows
Example:
from fenn.tabular import quick_sample
# Random sample of 3 rows
sample = quick_sample(df, n=3, seed=42)
print(sample)
# Sample specific columns only
sample_cols = quick_sample(df, n=3, columns=['age', 'department'], seed=42)
print(sample_cols)
5. unique_report(df)¶
Shows the number of unique values per column and percentage of unique values relative to total rows.
Parameters:
- df (pd.DataFrame): The input DataFrame
Returns: - pd.DataFrame: Unique count and percentage for each column
Example:
from fenn.tabular import unique_report
uniqueness = unique_report(df)
print(uniqueness)
Output:
unique_count unique_%
age 4 80.0
salary 4 80.0
department 3 60.0
score 5 100.0
6. corr_overview(df, top_n=10)¶
Computes correlations between numeric columns and returns the strongest pairs as a tidy table, useful for quick correlation analysis.
Parameters:
- df (pd.DataFrame): The input DataFrame
- top_n (int): Number of top correlations to return (default: 10)
Returns: - pd.DataFrame: Pairs of features with their correlations sorted by absolute correlation
Example:
from fenn.tabular import corr_overview
# Get top 5 correlations
correlations = corr_overview(df, top_n=5)
print(correlations)
Output:
feature_1 feature_2 correlation abs_correlation
0 score age 0.95 0.95
1 salary score 0.87 0.87
7. array_summary(arr)¶
NumPy-oriented helper function that provides shape, dtype, basic statistics, and NaN checks on ndarrays.
Parameters:
- arr (np.ndarray): The input NumPy array
Returns: - pd.DataFrame: Summary statistics including shape, dtype, size, mean, std, min, max, and NaN counts
Example:
import numpy as np
from fenn.tabular import array_summary
# Create sample array
arr = np.array([1.5, 2.3, np.nan, 4.1, 5.2, np.nan])
# Get summary
arr_summary = array_summary(arr)
print(arr_summary)
Output:
shape dtype size mean std min max nan_count nan_%
0 (6,) float64 6 3.275000 1.674784 1.5 5.2 2 33.33
Complete Usage Example¶
import pandas as pd
import numpy as np
from fenn.tabular import (
summary,
missing_report,
numeric_profile,
quick_sample,
unique_report,
corr_overview,
array_summary
)
# Create a realistic dataset
df = pd.DataFrame({
'customer_id': range(1, 101),
'age': np.random.randint(20, 70, 100),
'income': np.random.randint(30000, 150000, 100),
'spending': np.random.randint(1000, 50000, 100),
'region': np.random.choice(['North', 'South', 'East', 'West'], 100),
'email': ['user' + str(i) + '@example.com' for i in range(100)]
})
# Add some missing values
df.loc[df.sample(10, random_state=42).index, 'age'] = None
df.loc[df.sample(5, random_state=42).index, 'income'] = None
# 1. Get overall summary
print("=" * 50)
print("DATASET SUMMARY")
print("=" * 50)
summary(df)
# 2. Check missing values
print("\n" + "=" * 50)
print("MISSING VALUES REPORT")
print("=" * 50)
print(missing_report(df))
# 3. Profile numeric columns
print("\n" + "=" * 50)
print("NUMERIC PROFILE")
print("=" * 50)
print(numeric_profile(df, clip_quantile=0.05))
# 4. Get random sample
print("\n" + "=" * 50)
print("RANDOM SAMPLE")
print("=" * 50)
print(quick_sample(df, n=3, seed=42))
# 5. Uniqueness analysis
print("\n" + "=" * 50)
print("UNIQUENESS REPORT")
print("=" * 50)
print(unique_report(df))
# 6. Correlation analysis
print("\n" + "=" * 50)
print("TOP CORRELATIONS")
print("=" * 50)
print(corr_overview(df, top_n=3))
# 7. Analyze a NumPy array
print("\n" + "=" * 50)
print("ARRAY SUMMARY")
print("=" * 50)
arr = np.array([10, 20, 30, np.nan, 50])
print(array_summary(arr))
Key Utilities¶
- Quick Analysis: Get comprehensive data insights in a few function calls
- Missing Data Detection: Identify and report problematic columns with missing values
- Outlier Handling: Optional quantile clipping for robust statistics
- Correlation Analysis: Easily identify the strongest relationships between numeric features
- Reproducible Sampling: Seed control for consistent sampling across runs
- NumPy Support: Dedicated functions for analyzing arrays alongside DataFrames
Use Cases¶
- Data Exploration: Start EDA with
summary()for a quick overview - Data Quality Checks: Use
missing_report()to identify data quality issues - Feature Engineering: Use
corr_overview()to find multicollinear features - Data Validation: Combine
numeric_profile()andquick_sample()for data validation pipelines - Reporting: Generate summary statistics for data documentation and reports