Analysis 16: Comprehensive Vendor-Specific CWE Analysis Tables - ALL VENDORS¶
Individual Vendor Tables (1-4):¶
- Microsoft vs All CVEs - Shows Microsoft's patch priorities compared to overall CVE landscape
- Cisco vs All CVEs - Cisco's security focus areas vs industry trends
- RedHat vs All CVEs - RedHat's vulnerability priorities vs general patterns
- GitHub vs All CVEs - Open-source vulnerability patterns vs overall ecosystem
Comparative Analysis Tables (5-6):¶
Open-Source vs Commercial vs All CVEs - Strategic comparison between:
Open-Source: GitHub ecosystem priorities
Commercial: Combined Microsoft + Cisco + RedHat priorities
All CVEs: Industry-wide vulnerability landscape
Vendor Summary Table - High-level comparison showing:¶
- Total patches per vendor
- Unique CWE coverage
- Top vulnerability focus area
- Coverage percentage of overall CWE landscape
🎯 Key Insights Revealed:¶
- Vendor-Specific Security Priorities: Each vendor's unique focus areas
- Open-Source vs Commercial Differences: Different security challenges and approaches
- Industry Coverage Gaps: CWEs that are over/under-represented in patches
- Strategic Security Alignment: How well vendor priorities match real-world threats
Each table includes CWE names, rankings, counts, and comparative analysis to provide comprehensive insights into vendor-specific vulnerability management strategies.
1. Environment Setup and Data Loading¶
In [ ]:
Copied!
import duckdb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
import os
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')
from matplotlib.patches import Patch
import matplotlib.patches as mpatches
from scipy import stats
# Try to use Modin for faster pandas operations
try:
import modin.pandas as mpd
USE_MODIN = True
print("Using Modin for accelerated pandas operations")
except ImportError:
import pandas as mpd
USE_MODIN = False
print("Using standard pandas (Modin not available)")
# Set up high-quality plotting parameters
plt.rcParams['figure.dpi'] = 300
plt.rcParams['savefig.dpi'] = 300
plt.rcParams['savefig.format'] = 'eps'
plt.rcParams['font.size'] = 12
plt.rcParams['axes.titlesize'] = 14
plt.rcParams['axes.labelsize'] = 12
plt.rcParams['xtick.labelsize'] = 10
plt.rcParams['ytick.labelsize'] = 10
plt.rcParams['legend.fontsize'] = 10
# Global analysis period settings
ANALYSIS_END_DATE = "2024-12-31"
ANALYSIS_START_DATE = "1999-01-01" # Set to None for all data
USE_ALL_DATA = True # Toggle this to switch between full dataset and filtered
# Create output directory for figures
os.makedirs('figures', exist_ok=True)
os.makedirs('parquet_data', exist_ok=True)
print(f"Analysis Period: {'All available data' if USE_ALL_DATA else f'{ANALYSIS_START_DATE} to {ANALYSIS_END_DATE}'}")
import duckdb import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import json import os from datetime import datetime, timedelta import warnings warnings.filterwarnings('ignore') from matplotlib.patches import Patch import matplotlib.patches as mpatches from scipy import stats # Try to use Modin for faster pandas operations try: import modin.pandas as mpd USE_MODIN = True print("Using Modin for accelerated pandas operations") except ImportError: import pandas as mpd USE_MODIN = False print("Using standard pandas (Modin not available)") # Set up high-quality plotting parameters plt.rcParams['figure.dpi'] = 300 plt.rcParams['savefig.dpi'] = 300 plt.rcParams['savefig.format'] = 'eps' plt.rcParams['font.size'] = 12 plt.rcParams['axes.titlesize'] = 14 plt.rcParams['axes.labelsize'] = 12 plt.rcParams['xtick.labelsize'] = 10 plt.rcParams['ytick.labelsize'] = 10 plt.rcParams['legend.fontsize'] = 10 # Global analysis period settings ANALYSIS_END_DATE = "2024-12-31" ANALYSIS_START_DATE = "1999-01-01" # Set to None for all data USE_ALL_DATA = True # Toggle this to switch between full dataset and filtered # Create output directory for figures os.makedirs('figures', exist_ok=True) os.makedirs('parquet_data', exist_ok=True) print(f"Analysis Period: {'All available data' if USE_ALL_DATA else f'{ANALYSIS_START_DATE} to {ANALYSIS_END_DATE}'}")
3. Load Parquet Data for Analysis¶
In [ ]:
Copied!
def load_parquet_data():
"""
Load Parquet files into DuckDB for analysis
"""
# Create a new connection for analysis
con = duckdb.connect(':memory:') # Use in-memory database for faster processing
# Load all parquet files
parquet_files = {
# MySQL tables
'cve_main': '\..\parquet_data\mysql_cve.parquet',
'cve_main_old': '\..\parquet_data\mysql_cvev5_v2.parquet',
'exploits': '\..\parquet_data\mysql_exploit.parquet',
'exploits_old': '\..\parquet_data\mysql_exploit_old.parquet',
'msrc_patches': '\..\parquet_data\mysql_msrc_vuln_unified.parquet',
'cisco_patches': '\..\parquet_data\mysql_cisco_vuln_unified.parquet',
'redhat_patches': '\..\parquet_data\mysql_redhat_vuln_unified.parquet',
'github_advisories': '\..\parquet_data\mysql_github_advisory_unified.parquet',
'cwe_ref': '\..\parquet_data\mysql_cwe.parquet',
'capec_ref': '\..\parquet_data\mysql_capec.parquet',
# PostgreSQL tables (MoreFixes)
'morefixes_cve': '\..\parquet_data\postgres_cve.parquet',
'morefixes_fixes': '\..\parquet_data\postgres_fixes.parquet',
'morefixes_commits': '\..\parquet_data\postgres_commits.parquet',
'morefixes_repository': '\..\parquet_data\postgres_repository.parquet'
}
# Create views for each parquet file
for table_name, file_path in parquet_files.items():
if os.path.exists(file_path):
con.sql(f"CREATE OR REPLACE VIEW {table_name} AS SELECT * FROM '{file_path}'")
print(f"✓ Loaded {table_name}")
else:
print(f"✗ File not found: {file_path}")
return con
# Load data for analysis
print("Loading Parquet data for analysis...")
analysis_con = load_parquet_data()
def load_parquet_data(): """ Load Parquet files into DuckDB for analysis """ # Create a new connection for analysis con = duckdb.connect(':memory:') # Use in-memory database for faster processing # Load all parquet files parquet_files = { # MySQL tables 'cve_main': '\..\parquet_data\mysql_cve.parquet', 'cve_main_old': '\..\parquet_data\mysql_cvev5_v2.parquet', 'exploits': '\..\parquet_data\mysql_exploit.parquet', 'exploits_old': '\..\parquet_data\mysql_exploit_old.parquet', 'msrc_patches': '\..\parquet_data\mysql_msrc_vuln_unified.parquet', 'cisco_patches': '\..\parquet_data\mysql_cisco_vuln_unified.parquet', 'redhat_patches': '\..\parquet_data\mysql_redhat_vuln_unified.parquet', 'github_advisories': '\..\parquet_data\mysql_github_advisory_unified.parquet', 'cwe_ref': '\..\parquet_data\mysql_cwe.parquet', 'capec_ref': '\..\parquet_data\mysql_capec.parquet', # PostgreSQL tables (MoreFixes) 'morefixes_cve': '\..\parquet_data\postgres_cve.parquet', 'morefixes_fixes': '\..\parquet_data\postgres_fixes.parquet', 'morefixes_commits': '\..\parquet_data\postgres_commits.parquet', 'morefixes_repository': '\..\parquet_data\postgres_repository.parquet' } # Create views for each parquet file for table_name, file_path in parquet_files.items(): if os.path.exists(file_path): con.sql(f"CREATE OR REPLACE VIEW {table_name} AS SELECT * FROM '{file_path}'") print(f"✓ Loaded {table_name}") else: print(f"✗ File not found: {file_path}") return con # Load data for analysis print("Loading Parquet data for analysis...") analysis_con = load_parquet_data()
Loading Parquet data for analysis... ✓ Loaded cve_main ✓ Loaded cve_main_old ✓ Loaded exploits ✓ Loaded exploits_old ✓ Loaded msrc_patches ✓ Loaded cisco_patches ✓ Loaded redhat_patches ✓ Loaded github_advisories ✓ Loaded cwe_ref ✓ Loaded capec_ref ✓ Loaded morefixes_cve ✓ Loaded morefixes_fixes ✓ Loaded morefixes_commits ✓ Loaded morefixes_repository
In [ ]:
Copied!
# List of all table names I've loaded
table_names = [
"cve_main", "cve_main_old", "exploits", "msrc_patches", "cisco_patches",
"redhat_patches", "github_advisories", "cwe_ref", "capec_ref",
"morefixes_cve", "morefixes_fixes", "morefixes_commits", "morefixes_repository"
]
print("\n--- Schema for all loaded tables ---")
for table_name in table_names:
print(f"\nSchema for table: {table_name}")
try:
# Execute PRAGMA table_info() to get schema
schema_info = analysis_con.execute(f"PRAGMA table_info('{table_name}');").fetchall()
if not schema_info:
print(f" (Table '{table_name}' not found or is empty)")
continue
# Print header
header = ["cid", "name", "type", "notnull", "pk", "dflt_value"]
print(f" {' '.join(f'{col:<15}' for col in header)}")
print(f" {'-'*90}")
# Print rows
for col_info in schema_info:
cid, name, col_type, notnull, pk, dflt_value = col_info
print(f" {cid:<15} {name:<15} {col_type:<15} {str(notnull):<15} {str(pk):<15} {str(dflt_value):<15}")
except duckdb.ParserException as e:
print(f" Error retrieving schema for {table_name}: {e}")
except Exception as e:
print(f" An unexpected error occurred for {table_name}: {e}")
# List of all table names I've loaded table_names = [ "cve_main", "cve_main_old", "exploits", "msrc_patches", "cisco_patches", "redhat_patches", "github_advisories", "cwe_ref", "capec_ref", "morefixes_cve", "morefixes_fixes", "morefixes_commits", "morefixes_repository" ] print("\n--- Schema for all loaded tables ---") for table_name in table_names: print(f"\nSchema for table: {table_name}") try: # Execute PRAGMA table_info() to get schema schema_info = analysis_con.execute(f"PRAGMA table_info('{table_name}');").fetchall() if not schema_info: print(f" (Table '{table_name}' not found or is empty)") continue # Print header header = ["cid", "name", "type", "notnull", "pk", "dflt_value"] print(f" {' '.join(f'{col:<15}' for col in header)}") print(f" {'-'*90}") # Print rows for col_info in schema_info: cid, name, col_type, notnull, pk, dflt_value = col_info print(f" {cid:<15} {name:<15} {col_type:<15} {str(notnull):<15} {str(pk):<15} {str(dflt_value):<15}") except duckdb.ParserException as e: print(f" Error retrieving schema for {table_name}: {e}") except Exception as e: print(f" An unexpected error occurred for {table_name}: {e}")
--- Schema for all loaded tables --- Schema for table: cve_main cid name type notnull pk dflt_value ------------------------------------------------------------------------------------------ 0 id BIGINT False None False 1 cve_id VARCHAR False None False 2 assigner_org VARCHAR False None False 3 state VARCHAR False None False 4 description VARCHAR False None False 5 date_reserved TIMESTAMP False None False 6 date_published TIMESTAMP False None False 7 date_updated TIMESTAMP False None False 8 cvss_v2_score FLOAT False None False 9 cvss_v2_vector VARCHAR False None False 10 cvss_v3_score FLOAT False None False 11 cvss_v3_vector VARCHAR False None False 12 cvss_v3_severity VARCHAR False None False 13 cvss_v4_score FLOAT False None False 14 cvss_v4_vector VARCHAR False None False 15 cvss_v4_severity VARCHAR False None False 16 cwe_ids VARCHAR False None False 17 cpes VARCHAR False None False 18 vendors VARCHAR False None False 19 products VARCHAR False None False 20 references VARCHAR False None False 21 ssvc_exploitation VARCHAR False None False 22 ssvc_automatable VARCHAR False None False 23 ssvc_technical_impact VARCHAR False None False 24 kev_known_exploited TINYINT False None False 25 kev_vendor_project VARCHAR False None False 26 kev_product VARCHAR False None False 27 kev_vulnerability_name VARCHAR False None False 28 kev_date_added TIMESTAMP False None False 29 kev_short_description VARCHAR False None False 30 kev_required_action VARCHAR False None False 31 kev_due_date TIMESTAMP False None False 32 kev_ransomware_use VARCHAR False None False 33 kev_notes VARCHAR False None False 34 kev_cwes VARCHAR False None False 35 epss_score FLOAT False None False 36 epss_percentile FLOAT False None False 37 data_sources VARCHAR False None False 38 created_at TIMESTAMP WITH TIME ZONE False None False 39 updated_at TIMESTAMP WITH TIME ZONE False None False 40 has_exploit TINYINT False None False 41 exploit_count INTEGER False None False 42 first_exploit_date TIMESTAMP False None False 43 latest_exploit_date TIMESTAMP False None False Schema for table: cve_main_old cid name type notnull pk dflt_value ------------------------------------------------------------------------------------------ 0 id BIGINT False None False 1 CVE ID VARCHAR False None False 2 State VARCHAR False None False 3 Date Published TIMESTAMP False None False 4 Date Updated TIMESTAMP False None False 5 Date Reserved TIMESTAMP False None False 6 Descriptions VARCHAR False None False 7 Affected Products VARCHAR False None False 8 References VARCHAR False None False 9 Problem Types VARCHAR False None False 10 Base Severity VARCHAR False None False 11 Confidentiality Impact VARCHAR False None False 12 Integrity Impact VARCHAR False None False 13 Availability Impact VARCHAR False None False 14 CVSS 2.0 Base Score FLOAT False None False 15 CVSS 3.0 Base Score FLOAT False None False 16 CVSS 3.1 Base Score FLOAT False None False 17 cwe VARCHAR False None False 18 EPSS FLOAT False None False 19 vendors VARCHAR False None False 20 Software CPES VARCHAR False None False 21 V Score FLOAT False None False Schema for table: exploits cid name type notnull pk dflt_value ------------------------------------------------------------------------------------------ 0 id BIGINT False None False 1 file VARCHAR False None False 2 description VARCHAR False None False 3 date_published TIMESTAMP False None False 4 author VARCHAR False None False 5 type VARCHAR False None False 6 platform VARCHAR False None False 7 port DOUBLE False None False 8 date_added TIMESTAMP False None False 9 date_updated TIMESTAMP False None False 10 verified BIGINT False None False 11 codes VARCHAR False None False 12 tags VARCHAR False None False 13 aliases VARCHAR False None False 14 screenshot_url VARCHAR False None False 15 application_url VARCHAR False None False 16 source_url VARCHAR False None False 17 cve_id VARCHAR False None False Schema for table: msrc_patches cid name type notnull pk dflt_value ------------------------------------------------------------------------------------------ 0 title VARCHAR False None False 1 release_date TIMESTAMP False None False 2 initial_release_date TIMESTAMP False None False 3 cvrf_id VARCHAR False None False 4 cve_id VARCHAR False None False 5 exploited_status INTEGER False None False 6 exploitation_potential_lsr INTEGER False None False 7 exploitation_potential_osr INTEGER False None False 8 publicly_disclosed INTEGER False None False 9 cvss_score FLOAT False None False 10 cvss_vector VARCHAR False None False 11 vuln_title VARCHAR False None False 12 product_id VARCHAR False None False 13 product_name VARCHAR False None False 14 product_branch VARCHAR False None False 15 product_cpe VARCHAR False None False 16 threats VARCHAR False None False 17 remediations VARCHAR False None False 18 cwe_ids VARCHAR False None False 19 notes VARCHAR False None False 20 acknowledgments VARCHAR False None False Schema for table: cisco_patches cid name type notnull pk dflt_value ------------------------------------------------------------------------------------------ 0 advisory_id VARCHAR False None False 1 title VARCHAR False None False 2 cve_id VARCHAR False None False 3 vulnerability_title VARCHAR False None False 4 current_release_date TIMESTAMP False None False 5 initial_release_date TIMESTAMP False None False 6 vulnerability_release_date TIMESTAMP False None False 7 status VARCHAR False None False 8 version VARCHAR False None False 9 publisher VARCHAR False None False 10 publisher_category VARCHAR False None False 11 summary VARCHAR False None False 12 details VARCHAR False None False 13 cvss_score FLOAT False None False 14 cvss_severity VARCHAR False None False 15 cvss_vector VARCHAR False None False 16 bug_ids VARCHAR False None False 17 product_id VARCHAR False None False 18 product_name VARCHAR False None False 19 product_full_path VARCHAR False None False 20 acknowledgments VARCHAR False None False 21 references VARCHAR False None False 22 remediations VARCHAR False None False Schema for table: redhat_patches cid name type notnull pk dflt_value ------------------------------------------------------------------------------------------ 0 id BIGINT False None False 1 advisory_id VARCHAR False None False 2 title VARCHAR False None False 3 cve_id VARCHAR False None False 4 cwe_id VARCHAR False None False 5 vulnerability_title VARCHAR False None False 6 current_release_date TIMESTAMP False None False 7 initial_release_date TIMESTAMP False None False 8 discovery_date TIMESTAMP False None False 9 release_date TIMESTAMP False None False 10 status VARCHAR False None False 11 version VARCHAR False None False 12 publisher VARCHAR False None False 13 publisher_category VARCHAR False None False 14 summary VARCHAR False None False 15 details VARCHAR False None False 16 cvss_score FLOAT False None False 17 cvss_severity VARCHAR False None False 18 cvss_vector VARCHAR False None False 19 threat_impact VARCHAR False None False 20 aggregate_severity VARCHAR False None False 21 product_id VARCHAR False None False 22 product_name VARCHAR False None False Schema for table: github_advisories cid name type notnull pk dflt_value ------------------------------------------------------------------------------------------ 0 id BIGINT False None False 1 ghsa_id VARCHAR False None False 2 schema_version VARCHAR False None False 3 published TIMESTAMP False None False 4 modified TIMESTAMP False None False 5 summary VARCHAR False None False 6 details VARCHAR False None False 7 primary_cve VARCHAR False None False 8 all_cves VARCHAR False None False 9 cvss_v3_score FLOAT False None False 10 cvss_v3_vector VARCHAR False None False 11 cvss_v4_score FLOAT False None False 12 cvss_v4_vector VARCHAR False None False 13 database_severity VARCHAR False None False 14 severity_score FLOAT False None False 15 cwe_ids VARCHAR False None False 16 github_reviewed BOOLEAN False None False 17 github_reviewed_at TIMESTAMP False None False 18 nvd_published_at TIMESTAMP False None False 19 exploited TINYINT False None False 20 exploitability_level TINYINT False None False 21 poc_available TINYINT False None False 22 patched TINYINT False None False 23 patch_available TINYINT False None False 24 primary_ecosystem VARCHAR False None False 25 all_ecosystems VARCHAR False None False 26 package_ecosystem VARCHAR False None False 27 package_name VARCHAR False None False 28 package_purl VARCHAR False None False 29 references VARCHAR False None False 30 affected_ranges VARCHAR False None False 31 affected_versions VARCHAR False None False 32 created_at TIMESTAMP WITH TIME ZONE False None False 33 updated_at TIMESTAMP WITH TIME ZONE False None False Schema for table: cwe_ref cid name type notnull pk dflt_value ------------------------------------------------------------------------------------------ 0 cwe_id VARCHAR False None False 1 name VARCHAR False None False 2 weakness_abstraction VARCHAR False None False 3 status VARCHAR False None False 4 description VARCHAR False None False 5 extended_description VARCHAR False None False 6 related_weaknesses VARCHAR False None False 7 weakness_ordinalities VARCHAR False None False 8 applicable_platforms VARCHAR False None False 9 background_details VARCHAR False None False 10 alternate_terms VARCHAR False None False 11 modes_of_introduction VARCHAR False None False 12 exploitation_factors VARCHAR False None False 13 likelihood_of_exploit VARCHAR False None False 14 common_consequences VARCHAR False None False 15 detection_methods VARCHAR False None False 16 potential_mitigations VARCHAR False None False 17 observed_examples VARCHAR False None False 18 functional_areas VARCHAR False None False 19 affected_resources VARCHAR False None False 20 taxonomy_mappings VARCHAR False None False 21 related_attack_patterns VARCHAR False None False 22 notes VARCHAR False None False 23 created_at TIMESTAMP WITH TIME ZONE False None False Schema for table: capec_ref cid name type notnull pk dflt_value ------------------------------------------------------------------------------------------ 0 capec_id VARCHAR False None False 1 name VARCHAR False None False 2 abstraction VARCHAR False None False 3 status VARCHAR False None False 4 description VARCHAR False None False 5 alternate_terms VARCHAR False None False 6 likelihood_of_attack VARCHAR False None False 7 typical_severity VARCHAR False None False 8 related_attack_patterns VARCHAR False None False 9 execution_flow VARCHAR False None False 10 prerequisites VARCHAR False None False 11 skills_required VARCHAR False None False 12 resources_required VARCHAR False None False 13 indicators VARCHAR False None False 14 consequences VARCHAR False None False 15 mitigations VARCHAR False None False 16 example_instances VARCHAR False None False 17 related_weaknesses VARCHAR False None False 18 taxonomy_mappings VARCHAR False None False 19 notes VARCHAR False None False 20 created_at TIMESTAMP WITH TIME ZONE False None False Schema for table: morefixes_cve cid name type notnull pk dflt_value ------------------------------------------------------------------------------------------ 0 cve_id VARCHAR False None False 1 published_date VARCHAR False None False 2 last_modified_date VARCHAR False None False 3 description VARCHAR False None False 4 nodes VARCHAR False None False 5 severity VARCHAR False None False 6 obtain_all_privilege VARCHAR False None False 7 obtain_user_privilege VARCHAR False None False 8 obtain_other_privilege VARCHAR False None False 9 user_interaction_required VARCHAR False None False 10 cvss2_vector_string VARCHAR False None False 11 cvss2_access_vector VARCHAR False None False 12 cvss2_access_complexity VARCHAR False None False 13 cvss2_authentication VARCHAR False None False 14 cvss2_confidentiality_impact VARCHAR False None False 15 cvss2_integrity_impact VARCHAR False None False 16 cvss2_availability_impact VARCHAR False None False 17 cvss2_base_score VARCHAR False None False 18 cvss3_vector_string VARCHAR False None False 19 cvss3_attack_vector VARCHAR False None False 20 cvss3_attack_complexity VARCHAR False None False 21 cvss3_privileges_required VARCHAR False None False 22 cvss3_user_interaction VARCHAR False None False 23 cvss3_scope VARCHAR False None False 24 cvss3_confidentiality_impact VARCHAR False None False 25 cvss3_integrity_impact VARCHAR False None False 26 cvss3_availability_impact VARCHAR False None False 27 cvss3_base_score VARCHAR False None False 28 cvss3_base_severity VARCHAR False None False 29 exploitability_score VARCHAR False None False 30 impact_score VARCHAR False None False 31 ac_insuf_info VARCHAR False None False 32 reference_json VARCHAR False None False 33 problemtype_json VARCHAR False None False Schema for table: morefixes_fixes cid name type notnull pk dflt_value ------------------------------------------------------------------------------------------ 0 cve_id VARCHAR False None False 1 hash VARCHAR False None False 2 repo_url VARCHAR False None False 3 rel_type VARCHAR False None False 4 score BIGINT False None False 5 extraction_status VARCHAR False None False Schema for table: morefixes_commits cid name type notnull pk dflt_value ------------------------------------------------------------------------------------------ 0 hash VARCHAR False None False 1 repo_url VARCHAR False None False 2 author VARCHAR False None False 3 committer VARCHAR False None False 4 msg VARCHAR False None False 5 parents VARCHAR False None False 6 author_timezone BIGINT False None False 7 num_lines_added BIGINT False None False 8 num_lines_deleted BIGINT False None False 9 dmm_unit_complexity DOUBLE False None False 10 dmm_unit_interfacing DOUBLE False None False 11 dmm_unit_size DOUBLE False None False 12 merge BOOLEAN False None False 13 committer_timezone BIGINT False None False 14 author_date TIMESTAMP WITH TIME ZONE False None False 15 committer_date TIMESTAMP WITH TIME ZONE False None False Schema for table: morefixes_repository cid name type notnull pk dflt_value ------------------------------------------------------------------------------------------ 0 repo_url VARCHAR False None False 1 repo_name VARCHAR False None False 2 description VARCHAR False None False 3 date_created TIMESTAMP False None False 4 date_last_push TIMESTAMP False None False 5 homepage VARCHAR False None False 6 repo_language VARCHAR False None False 7 owner VARCHAR False None False 8 forks_count BIGINT False None False 9 stars_count BIGINT False None False
In [ ]:
Copied!
def create_vendor_cwe_analysis():
"""
Create 6 comprehensive tables:
1. Microsoft vs All CVEs
2. Cisco vs All CVEs
3. RedHat vs All CVEs
4. GitHub vs All CVEs
5. Open-Source (GitHub) vs Commercial (MS+Cisco+RH) vs All CVEs
6. Summary comparison table
analysis addressing the "-" values and coverage calculation issues
"""
print("\n=== Vendor-Specific CWE Analysis ===")
# Get ALL CWEs ranking (not limited to 25) - with proper exclusions
all_cwe_base_query = """
WITH all_cve_cwe_split AS (
SELECT DISTINCT
cve_id,
TRIM(UNNEST(STRING_SPLIT(cwe_ids, ','))) as cwe_id
FROM cve_main
WHERE cwe_ids IS NOT NULL
AND cwe_ids != ''
AND cwe_ids != 'NVD-CWE-Other'
AND cwe_ids != 'NVD-CWE-noinfo'
AND state = 'PUBLISHED'
)
SELECT
cwe_id,
COUNT(DISTINCT cve_id) as all_count,
ROW_NUMBER() OVER (ORDER BY COUNT(DISTINCT cve_id) DESC) as all_rank
FROM all_cve_cwe_split
WHERE cwe_id IS NOT NULL
AND cwe_id != ''
AND cwe_id != 'NVD-CWE-Other'
AND cwe_id != 'NVD-CWE-noinfo'
GROUP BY cwe_id
ORDER BY COUNT(DISTINCT cve_id) DESC
"""
all_cve_ranking = analysis_con.sql(all_cwe_base_query).df()
total_unique_cwes = len(all_cve_ranking)
print(f"Total unique CWEs in database: {total_unique_cwes:,}")
# Microsoft Analysis - Get ALL CWEs (not limited) - with proper exclusions
microsoft_query = """
WITH ms_cwe_split AS (
SELECT DISTINCT
mp.cve_id,
TRIM(UNNEST(STRING_SPLIT(cm.cwe_ids, ','))) as cwe_id
FROM msrc_patches mp
JOIN cve_main cm ON mp.cve_id = cm.cve_id
WHERE cm.cwe_ids IS NOT NULL
AND cm.cwe_ids != ''
AND cm.cwe_ids != 'NVD-CWE-Other'
AND cm.cwe_ids != 'NVD-CWE-noinfo'
)
SELECT
cwe_id,
COUNT(DISTINCT cve_id) as ms_count,
ROW_NUMBER() OVER (ORDER BY COUNT(DISTINCT cve_id) DESC) as ms_rank
FROM ms_cwe_split
WHERE cwe_id IS NOT NULL
AND cwe_id != ''
AND cwe_id != 'NVD-CWE-Other'
AND cwe_id != 'NVD-CWE-noinfo'
GROUP BY cwe_id
ORDER BY COUNT(DISTINCT cve_id) DESC
"""
microsoft_all = analysis_con.sql(microsoft_query).df()
# Similar queries for other vendors (getting ALL CWEs, not limited) - with proper exclusions
cisco_query = """
WITH cisco_cwe_split AS (
SELECT DISTINCT
cp.cve_id,
TRIM(UNNEST(STRING_SPLIT(cm.cwe_ids, ','))) as cwe_id
FROM cisco_patches cp
JOIN cve_main cm ON cp.cve_id = cm.cve_id
WHERE cm.cwe_ids IS NOT NULL
AND cm.cwe_ids != ''
AND cm.cwe_ids != 'NVD-CWE-Other'
AND cm.cwe_ids != 'NVD-CWE-noinfo'
)
SELECT
cwe_id,
COUNT(DISTINCT cve_id) as cisco_count,
ROW_NUMBER() OVER (ORDER BY COUNT(DISTINCT cve_id) DESC) as cisco_rank
FROM cisco_cwe_split
WHERE cwe_id IS NOT NULL
AND cwe_id != ''
AND cwe_id != 'NVD-CWE-Other'
AND cwe_id != 'NVD-CWE-noinfo'
GROUP BY cwe_id
ORDER BY COUNT(DISTINCT cve_id) DESC
"""
cisco_all = analysis_con.sql(cisco_query).df()
# GitHub Analysis - with proper exclusions
github_query = """
WITH gh_cwe_split AS (
SELECT DISTINCT
ga.primary_cve as cve_id,
TRIM(UNNEST(STRING_SPLIT(cm.cwe_ids, ','))) as cwe_id
FROM github_advisories ga
JOIN cve_main cm ON ga.primary_cve = cm.cve_id
WHERE cm.cwe_ids IS NOT NULL
AND cm.cwe_ids != ''
AND cm.cwe_ids != 'NVD-CWE-Other'
AND cm.cwe_ids != 'NVD-CWE-noinfo'
AND ga.patched = 1
)
SELECT
cwe_id,
COUNT(DISTINCT cve_id) as gh_count,
ROW_NUMBER() OVER (ORDER BY COUNT(DISTINCT cve_id) DESC) as gh_rank
FROM gh_cwe_split
WHERE cwe_id IS NOT NULL
AND cwe_id != ''
AND cwe_id != 'NVD-CWE-Other'
AND cwe_id != 'NVD-CWE-noinfo'
GROUP BY cwe_id
ORDER BY COUNT(DISTINCT cve_id) DESC
"""
github_all = analysis_con.sql(github_query).df()
# RedHat Commercial - with proper exclusions
redhat_commercial_query = """
WITH rh_comm_cwe_split AS (
SELECT DISTINCT
rp.cve_id,
TRIM(UNNEST(STRING_SPLIT(cm.cwe_ids, ','))) as cwe_id
FROM redhat_patches rp
JOIN cve_main cm ON rp.cve_id = cm.cve_id
WHERE cm.cwe_ids IS NOT NULL
AND cm.cwe_ids != ''
AND cm.cwe_ids != 'NVD-CWE-Other'
AND cm.cwe_ids != 'NVD-CWE-noinfo'
AND rp.product_name IS NOT NULL
AND rp.product_name != ''
)
SELECT
cwe_id,
COUNT(DISTINCT cve_id) as rh_comm_count,
ROW_NUMBER() OVER (ORDER BY COUNT(DISTINCT cve_id) DESC) as rh_comm_rank
FROM rh_comm_cwe_split
WHERE cwe_id IS NOT NULL
AND cwe_id != ''
AND cwe_id != 'NVD-CWE-Other'
AND cwe_id != 'NVD-CWE-noinfo'
GROUP BY cwe_id
ORDER BY COUNT(DISTINCT cve_id) DESC
"""
redhat_commercial_all = analysis_con.sql(redhat_commercial_query).df()
# RedHat Open-Source - with proper exclusions
redhat_opensource_query = """
WITH rh_os_cwe_split AS (
SELECT DISTINCT
rp.cve_id,
TRIM(UNNEST(STRING_SPLIT(cm.cwe_ids, ','))) as cwe_id
FROM redhat_patches rp
JOIN cve_main cm ON rp.cve_id = cm.cve_id
WHERE cm.cwe_ids IS NOT NULL
AND cm.cwe_ids != ''
AND cm.cwe_ids != 'NVD-CWE-Other'
AND cm.cwe_ids != 'NVD-CWE-noinfo'
AND (rp.product_name IS NULL OR rp.product_name = '')
)
SELECT
cwe_id,
COUNT(DISTINCT cve_id) as rh_os_count,
ROW_NUMBER() OVER (ORDER BY COUNT(DISTINCT cve_id) DESC) as rh_os_rank
FROM rh_os_cwe_split
WHERE cwe_id IS NOT NULL
AND cwe_id != ''
AND cwe_id != 'NVD-CWE-Other'
AND cwe_id != 'NVD-CWE-noinfo'
GROUP BY cwe_id
ORDER BY COUNT(DISTINCT cve_id) DESC
"""
redhat_opensource_all = analysis_con.sql(redhat_opensource_query).df()
# Function to merge with ALL rankings and add CWE names
def merge_with_complete_rankings(vendor_data, vendor_name):
"""Merge vendor data with complete rankings"""
if vendor_data.empty:
return vendor_data
# Merge with all CVE rankings
merged = vendor_data.merge(
all_cve_ranking[['cwe_id', 'all_count', 'all_rank']],
on='cwe_id',
how='left'
)
# Add CWE names
if len(merged) > 0:
cwe_list = "', '".join(merged['cwe_id'].astype(str).tolist())
cwe_names_query = f"""
SELECT cwe_id, name as cwe_name
FROM cwe_ref
WHERE cwe_id IN ('{cwe_list}')
"""
try:
cwe_names = analysis_con.sql(cwe_names_query).df()
merged = merged.merge(cwe_names, on='cwe_id', how='left')
except Exception as e:
print(f"Warning: Could not fetch CWE names: {e}")
merged['cwe_name'] = 'Unknown'
merged['cwe_name'] = merged['cwe_name'].fillna('Unknown')
return merged
# Merge all data
ms_complete = merge_with_complete_rankings(microsoft_all, 'Microsoft')
cisco_complete = merge_with_complete_rankings(cisco_all, 'Cisco')
rh_comm_complete = merge_with_complete_rankings(redhat_commercial_all, 'RedHat Commercial')
rh_os_complete = merge_with_complete_rankings(redhat_opensource_all, 'RedHat Open-Source')
gh_complete = merge_with_complete_rankings(github_all, 'GitHub')
# Function to print top CWEs with proper formatting
def print_vendor_comparison(data, title, vendor_col, count_col, rank_col, limit=15):
"""Print vendor comparison with proper '-' handling"""
print(f"\n{'='*140}")
print(f"{title}")
print('='*140)
print(f"{'CWE ID':<12} {'Name':<40} {f'{vendor_col} Rank':<12} {f'{vendor_col} Count':<15} {'Global Rank':<12} {'Global Count':<12}")
print('='*140)
for _, row in data.head(limit).iterrows():
cwe_name = str(row.get('cwe_name', 'Unknown'))
if len(cwe_name) > 38:
cwe_name = cwe_name[:37] + "..."
vendor_rank = str(int(row[rank_col])) if pd.notna(row[rank_col]) else '-'
vendor_count = f"{int(row[count_col]):,}" if pd.notna(row[count_col]) else '-'
# : Proper handling of global rank
if pd.notna(row['all_rank']):
global_rank = str(int(row['all_rank']))
else:
global_rank = f">{len(all_cve_ranking)}" # Show it's beyond top rankings
global_count = f"{int(row['all_count']):,}" if pd.notna(row['all_count']) else '-'
print(f"{row['cwe_id']:<12} {cwe_name:<40} {vendor_rank:<12} {vendor_count:<15} {global_rank:<12} {global_count:<12}")
# Print all vendor comparisons
print_vendor_comparison(ms_complete, "TABLE 1: Microsoft Patches vs All CVEs - ", "MS", "ms_count", "ms_rank")
print_vendor_comparison(cisco_complete, "TABLE 2: Cisco Patches vs All CVEs - ", "Cisco", "cisco_count", "cisco_rank")
print_vendor_comparison(rh_comm_complete, "TABLE 3: RedHat Commercial vs All CVEs - ", "RH-Comm", "rh_comm_count", "rh_comm_rank")
print_vendor_comparison(rh_os_complete, "TABLE 4: RedHat Open-Source vs All CVEs - ", "RH-OS", "rh_os_count", "rh_os_rank")
print_vendor_comparison(gh_complete, "TABLE 5: GitHub Open-Source vs All CVEs - ", "GH-OS", "gh_count", "gh_rank")
# Summary Table with Correct Coverage Calculations
print(f"\n{'='*160}")
print("TABLE 6: Vendor CWE Coverage Summary")
print('='*160)
print(f"{'Vendor':<20} {'Total CVEs':<12} {'Actual CWEs':<12} {'Top CWE':<12} {'Top Count':<12} {'Coverage %':<15} {'Specialization':<20}")
print('='*160)
vendors_summary = [
('Microsoft', microsoft_all, 'ms_count'),
('Cisco', cisco_all, 'cisco_count'),
('RedHat Commercial', redhat_commercial_all, 'rh_comm_count'),
('RedHat Open-Source', redhat_opensource_all, 'rh_os_count'),
('GitHub Open-Source', github_all, 'gh_count')
]
for vendor_name, vendor_data, count_col in vendors_summary:
if not vendor_data.empty:
total_cves = vendor_data[count_col].sum()
actual_unique_cwes = len(vendor_data) # ACTUAL number of unique CWEs
top_cwe = vendor_data.iloc[0]['cwe_id']
top_count = vendor_data.iloc[0][count_col]
# Coverage Calculation
coverage_pct = (actual_unique_cwes / total_unique_cwes) * 100
# Calculate specialization (what % of their CVEs are in top CWE)
specialization_pct = (top_count / total_cves) * 100
print(f"{vendor_name:<20} {total_cves:<12,} {actual_unique_cwes:<12} {top_cwe:<12} "
f"{top_count:<12,} {coverage_pct:<15.1f}% {specialization_pct:<20.1f}%")
# Explanation of the changes
print(f"\n{'='*160}")
print("EXPLANATION OF FIXES:")
print('='*160)
print(f"1. \\\"Total unique CWEs in database: {total_unique_cwes:,}\\\"")
print("2. Coverage % = (Vendor's Unique CWEs / Total CWEs in DB) × 100")
print("3. Specialization % = (Top CWE Count / Total CVEs) × 100")
print("4. Global Rank > X means the CWE is not in the top X globally")
print("5. Removed artificial LIMIT 20 to show actual coverage")
# Analysis of vendor specializations
print("\n" + "="*100)
print("VENDOR SPECIALIZATION ANALYSIS:")
print("="*100)
print("\nMicrosoft Specializations (not in global top 25):")
ms_special = ms_complete[pd.isna(ms_complete['all_rank']) | (ms_complete['all_rank'] > 25)].head(5)
for _, row in ms_special.iterrows():
print(f" {row['cwe_id']}: {row['cwe_name']} ({int(row['ms_count'])} CVEs)")
print("\nCisco Specializations (not in global top 25):")
cisco_special = cisco_complete[pd.isna(cisco_complete['all_rank']) | (cisco_complete['all_rank'] > 25)].head(5)
for _, row in cisco_special.iterrows():
print(f" {row['cwe_id']}: {row['cwe_name']} ({int(row['cisco_count'])} CVEs)")
return {
'microsoft': ms_complete,
'cisco': cisco_complete,
'redhat_commercial': rh_comm_complete,
'redhat_opensource': rh_os_complete,
'github': gh_complete,
'all_rankings': all_cve_ranking,
'total_cwes': total_unique_cwes
}
# Execute Analysis
print("Starting Vendor-Specific CWE Analysis...")
fixed_results = create_vendor_cwe_analysis()
print("\n=== ANALYSIS COMPLETE ===")
def create_vendor_cwe_analysis(): """ Create 6 comprehensive tables: 1. Microsoft vs All CVEs 2. Cisco vs All CVEs 3. RedHat vs All CVEs 4. GitHub vs All CVEs 5. Open-Source (GitHub) vs Commercial (MS+Cisco+RH) vs All CVEs 6. Summary comparison table analysis addressing the "-" values and coverage calculation issues """ print("\n=== Vendor-Specific CWE Analysis ===") # Get ALL CWEs ranking (not limited to 25) - with proper exclusions all_cwe_base_query = """ WITH all_cve_cwe_split AS ( SELECT DISTINCT cve_id, TRIM(UNNEST(STRING_SPLIT(cwe_ids, ','))) as cwe_id FROM cve_main WHERE cwe_ids IS NOT NULL AND cwe_ids != '' AND cwe_ids != 'NVD-CWE-Other' AND cwe_ids != 'NVD-CWE-noinfo' AND state = 'PUBLISHED' ) SELECT cwe_id, COUNT(DISTINCT cve_id) as all_count, ROW_NUMBER() OVER (ORDER BY COUNT(DISTINCT cve_id) DESC) as all_rank FROM all_cve_cwe_split WHERE cwe_id IS NOT NULL AND cwe_id != '' AND cwe_id != 'NVD-CWE-Other' AND cwe_id != 'NVD-CWE-noinfo' GROUP BY cwe_id ORDER BY COUNT(DISTINCT cve_id) DESC """ all_cve_ranking = analysis_con.sql(all_cwe_base_query).df() total_unique_cwes = len(all_cve_ranking) print(f"Total unique CWEs in database: {total_unique_cwes:,}") # Microsoft Analysis - Get ALL CWEs (not limited) - with proper exclusions microsoft_query = """ WITH ms_cwe_split AS ( SELECT DISTINCT mp.cve_id, TRIM(UNNEST(STRING_SPLIT(cm.cwe_ids, ','))) as cwe_id FROM msrc_patches mp JOIN cve_main cm ON mp.cve_id = cm.cve_id WHERE cm.cwe_ids IS NOT NULL AND cm.cwe_ids != '' AND cm.cwe_ids != 'NVD-CWE-Other' AND cm.cwe_ids != 'NVD-CWE-noinfo' ) SELECT cwe_id, COUNT(DISTINCT cve_id) as ms_count, ROW_NUMBER() OVER (ORDER BY COUNT(DISTINCT cve_id) DESC) as ms_rank FROM ms_cwe_split WHERE cwe_id IS NOT NULL AND cwe_id != '' AND cwe_id != 'NVD-CWE-Other' AND cwe_id != 'NVD-CWE-noinfo' GROUP BY cwe_id ORDER BY COUNT(DISTINCT cve_id) DESC """ microsoft_all = analysis_con.sql(microsoft_query).df() # Similar queries for other vendors (getting ALL CWEs, not limited) - with proper exclusions cisco_query = """ WITH cisco_cwe_split AS ( SELECT DISTINCT cp.cve_id, TRIM(UNNEST(STRING_SPLIT(cm.cwe_ids, ','))) as cwe_id FROM cisco_patches cp JOIN cve_main cm ON cp.cve_id = cm.cve_id WHERE cm.cwe_ids IS NOT NULL AND cm.cwe_ids != '' AND cm.cwe_ids != 'NVD-CWE-Other' AND cm.cwe_ids != 'NVD-CWE-noinfo' ) SELECT cwe_id, COUNT(DISTINCT cve_id) as cisco_count, ROW_NUMBER() OVER (ORDER BY COUNT(DISTINCT cve_id) DESC) as cisco_rank FROM cisco_cwe_split WHERE cwe_id IS NOT NULL AND cwe_id != '' AND cwe_id != 'NVD-CWE-Other' AND cwe_id != 'NVD-CWE-noinfo' GROUP BY cwe_id ORDER BY COUNT(DISTINCT cve_id) DESC """ cisco_all = analysis_con.sql(cisco_query).df() # GitHub Analysis - with proper exclusions github_query = """ WITH gh_cwe_split AS ( SELECT DISTINCT ga.primary_cve as cve_id, TRIM(UNNEST(STRING_SPLIT(cm.cwe_ids, ','))) as cwe_id FROM github_advisories ga JOIN cve_main cm ON ga.primary_cve = cm.cve_id WHERE cm.cwe_ids IS NOT NULL AND cm.cwe_ids != '' AND cm.cwe_ids != 'NVD-CWE-Other' AND cm.cwe_ids != 'NVD-CWE-noinfo' AND ga.patched = 1 ) SELECT cwe_id, COUNT(DISTINCT cve_id) as gh_count, ROW_NUMBER() OVER (ORDER BY COUNT(DISTINCT cve_id) DESC) as gh_rank FROM gh_cwe_split WHERE cwe_id IS NOT NULL AND cwe_id != '' AND cwe_id != 'NVD-CWE-Other' AND cwe_id != 'NVD-CWE-noinfo' GROUP BY cwe_id ORDER BY COUNT(DISTINCT cve_id) DESC """ github_all = analysis_con.sql(github_query).df() # RedHat Commercial - with proper exclusions redhat_commercial_query = """ WITH rh_comm_cwe_split AS ( SELECT DISTINCT rp.cve_id, TRIM(UNNEST(STRING_SPLIT(cm.cwe_ids, ','))) as cwe_id FROM redhat_patches rp JOIN cve_main cm ON rp.cve_id = cm.cve_id WHERE cm.cwe_ids IS NOT NULL AND cm.cwe_ids != '' AND cm.cwe_ids != 'NVD-CWE-Other' AND cm.cwe_ids != 'NVD-CWE-noinfo' AND rp.product_name IS NOT NULL AND rp.product_name != '' ) SELECT cwe_id, COUNT(DISTINCT cve_id) as rh_comm_count, ROW_NUMBER() OVER (ORDER BY COUNT(DISTINCT cve_id) DESC) as rh_comm_rank FROM rh_comm_cwe_split WHERE cwe_id IS NOT NULL AND cwe_id != '' AND cwe_id != 'NVD-CWE-Other' AND cwe_id != 'NVD-CWE-noinfo' GROUP BY cwe_id ORDER BY COUNT(DISTINCT cve_id) DESC """ redhat_commercial_all = analysis_con.sql(redhat_commercial_query).df() # RedHat Open-Source - with proper exclusions redhat_opensource_query = """ WITH rh_os_cwe_split AS ( SELECT DISTINCT rp.cve_id, TRIM(UNNEST(STRING_SPLIT(cm.cwe_ids, ','))) as cwe_id FROM redhat_patches rp JOIN cve_main cm ON rp.cve_id = cm.cve_id WHERE cm.cwe_ids IS NOT NULL AND cm.cwe_ids != '' AND cm.cwe_ids != 'NVD-CWE-Other' AND cm.cwe_ids != 'NVD-CWE-noinfo' AND (rp.product_name IS NULL OR rp.product_name = '') ) SELECT cwe_id, COUNT(DISTINCT cve_id) as rh_os_count, ROW_NUMBER() OVER (ORDER BY COUNT(DISTINCT cve_id) DESC) as rh_os_rank FROM rh_os_cwe_split WHERE cwe_id IS NOT NULL AND cwe_id != '' AND cwe_id != 'NVD-CWE-Other' AND cwe_id != 'NVD-CWE-noinfo' GROUP BY cwe_id ORDER BY COUNT(DISTINCT cve_id) DESC """ redhat_opensource_all = analysis_con.sql(redhat_opensource_query).df() # Function to merge with ALL rankings and add CWE names def merge_with_complete_rankings(vendor_data, vendor_name): """Merge vendor data with complete rankings""" if vendor_data.empty: return vendor_data # Merge with all CVE rankings merged = vendor_data.merge( all_cve_ranking[['cwe_id', 'all_count', 'all_rank']], on='cwe_id', how='left' ) # Add CWE names if len(merged) > 0: cwe_list = "', '".join(merged['cwe_id'].astype(str).tolist()) cwe_names_query = f""" SELECT cwe_id, name as cwe_name FROM cwe_ref WHERE cwe_id IN ('{cwe_list}') """ try: cwe_names = analysis_con.sql(cwe_names_query).df() merged = merged.merge(cwe_names, on='cwe_id', how='left') except Exception as e: print(f"Warning: Could not fetch CWE names: {e}") merged['cwe_name'] = 'Unknown' merged['cwe_name'] = merged['cwe_name'].fillna('Unknown') return merged # Merge all data ms_complete = merge_with_complete_rankings(microsoft_all, 'Microsoft') cisco_complete = merge_with_complete_rankings(cisco_all, 'Cisco') rh_comm_complete = merge_with_complete_rankings(redhat_commercial_all, 'RedHat Commercial') rh_os_complete = merge_with_complete_rankings(redhat_opensource_all, 'RedHat Open-Source') gh_complete = merge_with_complete_rankings(github_all, 'GitHub') # Function to print top CWEs with proper formatting def print_vendor_comparison(data, title, vendor_col, count_col, rank_col, limit=15): """Print vendor comparison with proper '-' handling""" print(f"\n{'='*140}") print(f"{title}") print('='*140) print(f"{'CWE ID':<12} {'Name':<40} {f'{vendor_col} Rank':<12} {f'{vendor_col} Count':<15} {'Global Rank':<12} {'Global Count':<12}") print('='*140) for _, row in data.head(limit).iterrows(): cwe_name = str(row.get('cwe_name', 'Unknown')) if len(cwe_name) > 38: cwe_name = cwe_name[:37] + "..." vendor_rank = str(int(row[rank_col])) if pd.notna(row[rank_col]) else '-' vendor_count = f"{int(row[count_col]):,}" if pd.notna(row[count_col]) else '-' # : Proper handling of global rank if pd.notna(row['all_rank']): global_rank = str(int(row['all_rank'])) else: global_rank = f">{len(all_cve_ranking)}" # Show it's beyond top rankings global_count = f"{int(row['all_count']):,}" if pd.notna(row['all_count']) else '-' print(f"{row['cwe_id']:<12} {cwe_name:<40} {vendor_rank:<12} {vendor_count:<15} {global_rank:<12} {global_count:<12}") # Print all vendor comparisons print_vendor_comparison(ms_complete, "TABLE 1: Microsoft Patches vs All CVEs - ", "MS", "ms_count", "ms_rank") print_vendor_comparison(cisco_complete, "TABLE 2: Cisco Patches vs All CVEs - ", "Cisco", "cisco_count", "cisco_rank") print_vendor_comparison(rh_comm_complete, "TABLE 3: RedHat Commercial vs All CVEs - ", "RH-Comm", "rh_comm_count", "rh_comm_rank") print_vendor_comparison(rh_os_complete, "TABLE 4: RedHat Open-Source vs All CVEs - ", "RH-OS", "rh_os_count", "rh_os_rank") print_vendor_comparison(gh_complete, "TABLE 5: GitHub Open-Source vs All CVEs - ", "GH-OS", "gh_count", "gh_rank") # Summary Table with Correct Coverage Calculations print(f"\n{'='*160}") print("TABLE 6: Vendor CWE Coverage Summary") print('='*160) print(f"{'Vendor':<20} {'Total CVEs':<12} {'Actual CWEs':<12} {'Top CWE':<12} {'Top Count':<12} {'Coverage %':<15} {'Specialization':<20}") print('='*160) vendors_summary = [ ('Microsoft', microsoft_all, 'ms_count'), ('Cisco', cisco_all, 'cisco_count'), ('RedHat Commercial', redhat_commercial_all, 'rh_comm_count'), ('RedHat Open-Source', redhat_opensource_all, 'rh_os_count'), ('GitHub Open-Source', github_all, 'gh_count') ] for vendor_name, vendor_data, count_col in vendors_summary: if not vendor_data.empty: total_cves = vendor_data[count_col].sum() actual_unique_cwes = len(vendor_data) # ACTUAL number of unique CWEs top_cwe = vendor_data.iloc[0]['cwe_id'] top_count = vendor_data.iloc[0][count_col] # Coverage Calculation coverage_pct = (actual_unique_cwes / total_unique_cwes) * 100 # Calculate specialization (what % of their CVEs are in top CWE) specialization_pct = (top_count / total_cves) * 100 print(f"{vendor_name:<20} {total_cves:<12,} {actual_unique_cwes:<12} {top_cwe:<12} " f"{top_count:<12,} {coverage_pct:<15.1f}% {specialization_pct:<20.1f}%") # Explanation of the changes print(f"\n{'='*160}") print("EXPLANATION OF FIXES:") print('='*160) print(f"1. \\\"Total unique CWEs in database: {total_unique_cwes:,}\\\"") print("2. Coverage % = (Vendor's Unique CWEs / Total CWEs in DB) × 100") print("3. Specialization % = (Top CWE Count / Total CVEs) × 100") print("4. Global Rank > X means the CWE is not in the top X globally") print("5. Removed artificial LIMIT 20 to show actual coverage") # Analysis of vendor specializations print("\n" + "="*100) print("VENDOR SPECIALIZATION ANALYSIS:") print("="*100) print("\nMicrosoft Specializations (not in global top 25):") ms_special = ms_complete[pd.isna(ms_complete['all_rank']) | (ms_complete['all_rank'] > 25)].head(5) for _, row in ms_special.iterrows(): print(f" {row['cwe_id']}: {row['cwe_name']} ({int(row['ms_count'])} CVEs)") print("\nCisco Specializations (not in global top 25):") cisco_special = cisco_complete[pd.isna(cisco_complete['all_rank']) | (cisco_complete['all_rank'] > 25)].head(5) for _, row in cisco_special.iterrows(): print(f" {row['cwe_id']}: {row['cwe_name']} ({int(row['cisco_count'])} CVEs)") return { 'microsoft': ms_complete, 'cisco': cisco_complete, 'redhat_commercial': rh_comm_complete, 'redhat_opensource': rh_os_complete, 'github': gh_complete, 'all_rankings': all_cve_ranking, 'total_cwes': total_unique_cwes } # Execute Analysis print("Starting Vendor-Specific CWE Analysis...") fixed_results = create_vendor_cwe_analysis() print("\n=== ANALYSIS COMPLETE ===")
Starting FIXED Vendor-Specific CWE Analysis... === FIXED Vendor-Specific CWE Analysis === Total unique CWEs in database: 733
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))
============================================================================================================================================
TABLE 1: Microsoft Patches vs All CVEs - FIXED
============================================================================================================================================
CWE ID Name MS Rank MS Count Global Rank Global Count
============================================================================================================================================
CWE-416 Use After Free 1 1,231 10 5,625
CWE-787 Out-of-bounds Write 2 868 5 9,849
CWE-125 Out-of-bounds Read 3 532 8 6,879
CWE-122 Heap-based Buffer Overflow 4 529 32 1,476
CWE-476 NULL Pointer Dereference 5 521 15 3,724
CWE-200 Exposure of Sensitive Information to ... 6 495 6 8,669
CWE-119 Improper Restriction of Operations wi... 7 491 3 12,037
CWE-20 Improper Input Validation 8 490 4 10,840
CWE-79 Improper Neutralization of Input Duri... 9 328 1 35,275
CWE-190 Integer Overflow or Wraparound 10 261 21 2,585
CWE-269 Improper Privilege Management 11 203 25 1,974
CWE-362 Concurrent Execution using Shared Res... 12 198 30 1,643
CWE-400 Uncontrolled Resource Consumption 13 188 23 2,279
CWE-59 Improper Link Resolution Before File ... 14 171 37 1,195
CWE-843 Access of Resource Using Incompatible... 15 159 52 589
============================================================================================================================================
TABLE 2: Cisco Patches vs All CVEs - FIXED
============================================================================================================================================
CWE ID Name Cisco Rank Cisco Count Global Rank Global Count
============================================================================================================================================
CWE-20 Improper Input Validation 1 964 4 10,840
CWE-79 Improper Neutralization of Input Duri... 2 636 1 35,275
CWE-399 Unknown 3 478 20 2,692
CWE-119 Improper Restriction of Operations wi... 4 342 3 12,037
CWE-264 Unknown 5 323 11 5,444
CWE-200 Exposure of Sensitive Information to ... 6 281 6 8,669
CWE-78 Improper Neutralization of Special El... 7 180 14 4,045
CWE-284 Improper Access Control 8 150 17 3,391
CWE-287 Improper Authentication 9 147 16 3,413
CWE-352 Cross-Site Request Forgery (CSRF) 10 143 7 7,524
CWE-121 Stack-based Buffer Overflow 11 128 26 1,973
CWE-22 Improper Limitation of a Pathname to ... 12 113 9 6,723
CWE-77 Improper Neutralization of Special El... 13 110 24 2,222
CWE-89 Improper Neutralization of Special El... 14 109 2 14,642
CWE-400 Uncontrolled Resource Consumption 15 105 23 2,279
============================================================================================================================================
TABLE 3: RedHat Commercial vs All CVEs - FIXED
============================================================================================================================================
CWE ID Name RH-Comm Rank RH-Comm Count Global Rank Global Count
============================================================================================================================================
CWE-502 Deserialization of Untrusted Data 1 126 29 1,760
CWE-20 Improper Input Validation 2 104 4 10,840
CWE-200 Exposure of Sensitive Information to ... 3 91 6 8,669
CWE-79 Improper Neutralization of Input Duri... 4 88 1 35,275
CWE-400 Uncontrolled Resource Consumption 5 83 23 2,279
CWE-264 Unknown 6 69 11 5,444
CWE-119 Improper Restriction of Operations wi... 7 55 3 12,037
CWE-770 Allocation of Resources Without Limit... 8 46 42 1,000
CWE-22 Improper Limitation of a Pathname to ... 9 44 9 6,723
CWE-287 Improper Authentication 10 43 16 3,413
CWE-190 Integer Overflow or Wraparound 11 38 21 2,585
CWE-310 Unknown 12 35 22 2,336
CWE-611 Improper Restriction of XML External ... 13 34 40 1,098
CWE-444 Inconsistent Interpretation of HTTP R... 14 33 98 238
CWE-918 Server-Side Request Forgery (SSRF) 15 32 31 1,515
============================================================================================================================================
TABLE 4: RedHat Open-Source vs All CVEs - FIXED
============================================================================================================================================
CWE ID Name RH-OS Rank RH-OS Count Global Rank Global Count
============================================================================================================================================
CWE-119 Improper Restriction of Operations wi... 1 1,616 3 12,037
CWE-416 Use After Free 2 1,122 10 5,625
CWE-787 Out-of-bounds Write 3 1,107 5 9,849
CWE-20 Improper Input Validation 4 1,073 4 10,840
CWE-125 Out-of-bounds Read 5 728 8 6,879
CWE-200 Exposure of Sensitive Information to ... 6 650 6 8,669
CWE-264 Unknown 7 615 11 5,444
CWE-476 NULL Pointer Dereference 8 599 15 3,724
CWE-399 Unknown 9 512 20 2,692
CWE-79 Improper Neutralization of Input Duri... 10 511 1 35,275
CWE-190 Integer Overflow or Wraparound 11 409 21 2,585
CWE-189 Unknown 12 398 35 1,225
CWE-400 Uncontrolled Resource Consumption 13 348 23 2,279
CWE-362 Concurrent Execution using Shared Res... 14 309 30 1,643
CWE-94 Improper Control of Generation of Cod... 15 213 13 4,455
============================================================================================================================================
TABLE 5: GitHub Open-Source vs All CVEs - FIXED
============================================================================================================================================
CWE ID Name GH-OS Rank GH-OS Count Global Rank Global Count
============================================================================================================================================
CWE-79 Improper Neutralization of Input Duri... 1 2,633 1 35,275
CWE-20 Improper Input Validation 2 824 4 10,840
CWE-200 Exposure of Sensitive Information to ... 3 760 6 8,669
CWE-22 Improper Limitation of a Pathname to ... 4 708 9 6,723
CWE-352 Cross-Site Request Forgery (CSRF) 5 514 7 7,524
CWE-400 Uncontrolled Resource Consumption 6 464 23 2,279
CWE-502 Deserialization of Untrusted Data 7 404 29 1,760
CWE-94 Improper Control of Generation of Cod... 8 390 13 4,455
CWE-89 Improper Neutralization of Special El... 9 378 2 14,642
CWE-287 Improper Authentication 10 324 16 3,413
CWE-787 Out-of-bounds Write 11 297 5 9,849
CWE-863 Incorrect Authorization 12 297 28 1,775
CWE-862 Missing Authorization 13 280 12 4,513
CWE-284 Improper Access Control 14 256 17 3,391
CWE-918 Server-Side Request Forgery (SSRF) 15 252 31 1,515
================================================================================================================================================================
TABLE 6: FIXED Vendor CWE Coverage Summary
================================================================================================================================================================
Vendor Total CVEs Actual CWEs Top CWE Top Count Coverage % Specialization
================================================================================================================================================================
Microsoft 10,225 314 CWE-416 1,231 42.8 % 12.0 %
Cisco 5,734 254 CWE-20 964 34.7 % 16.8 %
RedHat Commercial 1,760 203 CWE-502 126 27.7 % 7.2 %
RedHat Open-Source 15,917 359 CWE-119 1,616 49.0 % 10.2 %
GitHub Open-Source 16,634 444 CWE-79 2,633 60.6 % 15.8 %
================================================================================================================================================================
EXPLANATION OF FIXES:
================================================================================================================================================================
1. \"Total unique CWEs in database: 733\"
2. Coverage % = (Vendor's Unique CWEs / Total CWEs in DB) × 100
3. Specialization % = (Top CWE Count / Total CVEs) × 100
4. Global Rank > X means the CWE is not in the top X globally
5. Removed artificial LIMIT 20 to show actual coverage
====================================================================================================
VENDOR SPECIALIZATION ANALYSIS:
====================================================================================================
Microsoft Specializations (not in global top 25):
CWE-122: Heap-based Buffer Overflow (529 CVEs)
CWE-362: Concurrent Execution using Shared Resource with Improper Synchronization ('Race Condition') (198 CVEs)
CWE-59: Improper Link Resolution Before File Access ('Link Following') (171 CVEs)
CWE-843: Access of Resource Using Incompatible Type ('Type Confusion') (159 CVEs)
CWE-401: Missing Release of Memory after Effective Lifetime (120 CVEs)
Cisco Specializations (not in global top 25):
CWE-121: Stack-based Buffer Overflow (128 CVEs)
CWE-255: Unknown (53 CVEs)
CWE-285: Improper Authorization (43 CVEs)
CWE-347: Improper Verification of Cryptographic Signature (38 CVEs)
CWE-693: Protection Mechanism Failure (35 CVEs)
=== FIXED ANALYSIS COMPLETE ===
In [ ]:
Copied!
# Set style for better visualizations
plt.style.use('default')
sns.set_palette("husl")
def create_statistical_concepts_visualization():
"""
Create comprehensive visualizations explaining R-value, Gini coefficient, and Standard Deviation
"""
print("Creating Statistical Concepts Visualization...")
# Create a large figure with multiple subplots
fig = plt.figure(figsize=(20, 16))
# =================================================================
# PART 1: CORRELATION COEFFICIENT (R-VALUE) EXPLANATION
# =================================================================
print("1. Explaining Correlation Coefficient (R-value)...")
# Create sample data for different correlation scenarios
np.random.seed(42)
n_points = 50
# Perfect positive correlation (r ≈ 1.0)
x1 = np.linspace(0, 10, n_points)
y1 = 2 * x1 + np.random.normal(0, 0.5, n_points)
r1 = np.corrcoef(x1, y1)[0, 1]
# Perfect negative correlation (r ≈ -1.0)
x2 = np.linspace(0, 10, n_points)
y2 = -1.5 * x2 + 15 + np.random.normal(0, 0.5, n_points)
r2 = np.corrcoef(x2, y2)[0, 1]
# No correlation (r ≈ 0)
x3 = np.random.uniform(0, 10, n_points)
y3 = np.random.uniform(0, 10, n_points)
r3 = np.corrcoef(x3, y3)[0, 1]
# Moderate positive correlation (r ≈ 0.6)
x4 = np.linspace(0, 10, n_points)
y4 = 1.2 * x4 + np.random.normal(0, 2, n_points)
r4 = np.corrcoef(x4, y4)[0, 1]
# Subplot 1: Strong Positive Correlation
ax1 = plt.subplot(4, 3, 1)
plt.scatter(x1, y1, alpha=0.7, color='blue', s=60)
z1 = np.polyfit(x1, y1, 1)
p1 = np.poly1d(z1)
plt.plot(x1, p1(x1), "r--", alpha=0.8, linewidth=2)
plt.title(f'Strong Positive Correlation\nr = {r1:.3f}', fontsize=12, fontweight='bold')
plt.xlabel('X Variable', fontsize=10)
plt.ylabel('Y Variable', fontsize=10)
plt.grid(alpha=0.3)
# Subplot 2: Strong Negative Correlation
ax2 = plt.subplot(4, 3, 2)
plt.scatter(x2, y2, alpha=0.7, color='red', s=60)
z2 = np.polyfit(x2, y2, 1)
p2 = np.poly1d(z2)
plt.plot(x2, p2(x2), "r--", alpha=0.8, linewidth=2)
plt.title(f'Strong Negative Correlation\nr = {r2:.3f}', fontsize=12, fontweight='bold')
plt.xlabel('X Variable', fontsize=10)
plt.ylabel('Y Variable', fontsize=10)
plt.grid(alpha=0.3)
# Subplot 3: No Correlation
ax3 = plt.subplot(4, 3, 3)
plt.scatter(x3, y3, alpha=0.7, color='green', s=60)
z3 = np.polyfit(x3, y3, 1)
p3 = np.poly1d(z3)
plt.plot(x3, p3(x3), "r--", alpha=0.8, linewidth=2)
plt.title(f'No Correlation\nr = {r3:.3f}', fontsize=12, fontweight='bold')
plt.xlabel('X Variable', fontsize=10)
plt.ylabel('Y Variable', fontsize=10)
plt.grid(alpha=0.3)
# =================================================================
# PART 2: GINI COEFFICIENT EXPLANATION
# =================================================================
print("2. Explaining Gini Coefficient...")
# Create sample data for different inequality scenarios
# Perfect equality (Gini = 0)
equal_data = [20, 20, 20, 20, 20] # Everyone has equal amount
# High inequality (Gini ≈ 0.8)
unequal_data = [1, 2, 3, 4, 90] # One person has most
# Moderate inequality (Gini ≈ 0.4)
moderate_data = [10, 15, 20, 25, 30] # Some inequality
def calculate_gini(data):
"""Calculate Gini coefficient"""
data = np.array(data, dtype=float)
data = np.sort(data)
n = len(data)
cumsum = np.cumsum(data)
return (n + 1 - 2 * np.sum(cumsum) / cumsum[-1]) / n
def plot_lorenz_curve(data, ax, title, color):
"""Plot Lorenz curve for Gini visualization"""
data = np.array(data, dtype=float)
data = np.sort(data)
n = len(data)
# Calculate cumulative percentages
cum_people = np.arange(1, n + 1) / n
cum_wealth = np.cumsum(data) / np.sum(data)
# Add origin point
cum_people = np.insert(cum_people, 0, 0)
cum_wealth = np.insert(cum_wealth, 0, 0)
# Plot Lorenz curve
ax.plot(cum_people, cum_wealth, color=color, linewidth=3, label='Lorenz Curve')
ax.fill_between(cum_people, cum_wealth, cum_people, alpha=0.3, color=color, label='Inequality Area')
# Plot line of equality
ax.plot([0, 1], [0, 1], 'k--', linewidth=2, alpha=0.7, label='Perfect Equality')
# Calculate and display Gini
gini = calculate_gini(data)
ax.set_title(f'{title}\nGini = {gini:.3f}', fontsize=12, fontweight='bold')
ax.set_xlabel('Cumulative % of People', fontsize=10)
ax.set_ylabel('Cumulative % of Resource', fontsize=10)
ax.grid(alpha=0.3)
ax.legend(fontsize=8)
return gini
# Subplot 4: Perfect Equality
ax4 = plt.subplot(4, 3, 4)
gini1 = plot_lorenz_curve(equal_data, ax4, 'Perfect Equality', 'green')
# Subplot 5: Moderate Inequality
ax5 = plt.subplot(4, 3, 5)
gini2 = plot_lorenz_curve(moderate_data, ax5, 'Moderate Inequality', 'orange')
# Subplot 6: High Inequality
ax6 = plt.subplot(4, 3, 6)
gini3 = plot_lorenz_curve(unequal_data, ax6, 'High Inequality', 'red')
# =================================================================
# PART 3: STANDARD DEVIATION EXPLANATION
# =================================================================
print("3. Explaining Standard Deviation (SD)...")
# Create datasets with different standard deviations
np.random.seed(42)
# Low SD (data clustered around mean)
low_sd_data = np.random.normal(50, 5, 1000) # mean=50, sd=5
# Medium SD
med_sd_data = np.random.normal(50, 15, 1000) # mean=50, sd=15
# High SD (data spread out)
high_sd_data = np.random.normal(50, 25, 1000) # mean=50, sd=25
# Subplot 7: Low Standard Deviation
ax7 = plt.subplot(4, 3, 7)
plt.hist(low_sd_data, bins=30, alpha=0.7, color='blue', density=True)
mean_low = np.mean(low_sd_data)
sd_low = np.std(low_sd_data)
plt.axvline(mean_low, color='red', linestyle='--', linewidth=2, label=f'Mean = {mean_low:.1f}')
plt.axvline(mean_low - sd_low, color='orange', linestyle=':', linewidth=2, alpha=0.7)
plt.axvline(mean_low + sd_low, color='orange', linestyle=':', linewidth=2, alpha=0.7, label=f'±1 SD = ±{sd_low:.1f}')
plt.title(f'Low Standard Deviation\nSD = {sd_low:.1f}', fontsize=12, fontweight='bold')
plt.xlabel('Value', fontsize=10)
plt.ylabel('Frequency', fontsize=10)
plt.legend(fontsize=8)
plt.grid(alpha=0.3)
# Subplot 8: Medium Standard Deviation
ax8 = plt.subplot(4, 3, 8)
plt.hist(med_sd_data, bins=30, alpha=0.7, color='orange', density=True)
mean_med = np.mean(med_sd_data)
sd_med = np.std(med_sd_data)
plt.axvline(mean_med, color='red', linestyle='--', linewidth=2, label=f'Mean = {mean_med:.1f}')
plt.axvline(mean_med - sd_med, color='orange', linestyle=':', linewidth=2, alpha=0.7)
plt.axvline(mean_med + sd_med, color='orange', linestyle=':', linewidth=2, alpha=0.7, label=f'±1 SD = ±{sd_med:.1f}')
plt.title(f'Medium Standard Deviation\nSD = {sd_med:.1f}', fontsize=12, fontweight='bold')
plt.xlabel('Value', fontsize=10)
plt.ylabel('Frequency', fontsize=10)
plt.legend(fontsize=8)
plt.grid(alpha=0.3)
# Subplot 9: High Standard Deviation
ax9 = plt.subplot(4, 3, 9)
plt.hist(high_sd_data, bins=30, alpha=0.7, color='red', density=True)
mean_high = np.mean(high_sd_data)
sd_high = np.std(high_sd_data)
plt.axvline(mean_high, color='red', linestyle='--', linewidth=2, label=f'Mean = {mean_high:.1f}')
plt.axvline(mean_high - sd_high, color='orange', linestyle=':', linewidth=2, alpha=0.7)
plt.axvline(mean_high + sd_high, color='orange', linestyle=':', linewidth=2, alpha=0.7, label=f'±1 SD = ±{sd_high:.1f}')
plt.title(f'High Standard Deviation\nSD = {sd_high:.1f}', fontsize=12, fontweight='bold')
plt.xlabel('Value', fontsize=10)
plt.ylabel('Frequency', fontsize=10)
plt.legend(fontsize=8)
plt.grid(alpha=0.3)
# =================================================================
# PART 4: PRACTICAL EXAMPLES FROM OUR RESEARCH
# =================================================================
print("4. Creating Practical Examples from OUR Research...")
# Example from our vendor analysis
vendors = ['Microsoft', 'Cisco', 'RedHat Com.', 'RedHat OS', 'GitHub OS']
coverage_pct = [42.8, 34.7, 27.7, 49.0, 60.6]
specialization_pct = [11.8, 16.8, 7.1, 10.0, 15.7]
# Subplot 10: our Research Example - Correlation
ax10 = plt.subplot(4, 3, 10)
plt.scatter(coverage_pct, specialization_pct, s=100, alpha=0.8, color=['blue', 'orange', 'green', 'red', 'purple'])
# Add vendor labels
for i, vendor in enumerate(vendors):
plt.annotate(vendor, (coverage_pct[i], specialization_pct[i]),
xytext=(5, 5), textcoords='offset points', fontsize=9)
# Calculate correlation
r_research = np.corrcoef(coverage_pct, specialization_pct)[0, 1]
# Add trend line
z = np.polyfit(coverage_pct, specialization_pct, 1)
p = np.poly1d(z)
plt.plot(coverage_pct, p(coverage_pct), "r--", alpha=0.8, linewidth=2)
plt.title(f'our Research: Coverage vs Specialization\nr = {r_research:.3f}', fontsize=12, fontweight='bold')
plt.xlabel('CWE Coverage %', fontsize=10)
plt.ylabel('Specialization %', fontsize=10)
plt.grid(alpha=0.3)
# Subplot 11: our Research Example - Standard Deviation
ax11 = plt.subplot(4, 3, 11)
mean_coverage = np.mean(coverage_pct)
sd_coverage = np.std(coverage_pct, ddof=1) # Sample standard deviation
bars = plt.bar(vendors, coverage_pct, alpha=0.7, color=['blue', 'orange', 'green', 'red', 'purple'])
plt.axhline(mean_coverage, color='red', linestyle='--', linewidth=2, label=f'Mean = {mean_coverage:.1f}%')
plt.axhline(mean_coverage + sd_coverage, color='orange', linestyle=':', alpha=0.7, label=f'+1 SD = {mean_coverage + sd_coverage:.1f}%')
plt.axhline(mean_coverage - sd_coverage, color='orange', linestyle=':', alpha=0.7, label=f'-1 SD = {mean_coverage - sd_coverage:.1f}%')
plt.title(f'our Research: CWE Coverage\nMean = {mean_coverage:.1f}%, SD = {sd_coverage:.1f}%', fontsize=12, fontweight='bold')
plt.ylabel('CWE Coverage %', fontsize=10)
plt.xticks(rotation=45, ha='right')
plt.legend(fontsize=8)
plt.grid(alpha=0.3)
# Subplot 12: Summary Interpretation
ax12 = plt.subplot(4, 3, 12)
ax12.axis('off')
# Add interpretation text
interpretation_text = f"""
INTERPRETATION OF OUR RESULTS:
R-VALUE (Correlation):
• r = {r_research:.3f} shows MODERATE POSITIVE correlation
• Vendors with higher CWE coverage tend to have
higher specialization (but not perfectly)
STANDARD DEVIATION:
• Coverage SD = {sd_coverage:.1f}% shows MODERATE spread
• Vendors vary significantly in CWE diversity
• GitHub (60.6%) much higher than RedHat Com. (27.7%)
GINI COEFFICIENT USAGE:
• In our paper: Measures CWE distribution inequality
• Higher Gini = More concentrated vulnerabilities
• Lower Gini = More evenly distributed vulnerabilities
"""
ax12.text(0.05, 0.95, interpretation_text, transform=ax12.transAxes, fontsize=10,
verticalalignment='top', fontfamily='monospace',
bbox=dict(boxstyle="round,pad=0.5", facecolor="lightblue", alpha=0.8))
# Add overall title
fig.suptitle('Statistical Concepts Explained: R-value, Gini Coefficient, and Standard Deviation',
fontsize=16, fontweight='bold', y=0.98)
plt.tight_layout()
plt.subplots_adjust(top=0.90)
plt.savefig('statistical_concepts_explanation.png', dpi=300, bbox_inches='tight')
plt.show()
# =================================================================
# DETAILED EXPLANATIONS
# =================================================================
print("\n" + "="*80)
print("DETAILED EXPLANATIONS")
print("="*80)
print("\n. CORRELATION COEFFICIENT (R-VALUE):")
print(" • Range: -1.0 to +1.0")
print(" • +1.0 = Perfect positive relationship (as X increases, Y increases)")
print(" • -1.0 = Perfect negative relationship (as X increases, Y decreases)")
print(" • 0.0 = No linear relationship")
print(" • |r| > 0.7 = Strong correlation")
print(" • 0.3 < |r| < 0.7 = Moderate correlation")
print(" • |r| < 0.3 = Weak correlation")
print(f" • OUR RESULT: r = {r_research:.3f} = MODERATE positive correlation")
print("\n2. GINI COEFFICIENT:")
print(" • Range: 0.0 to 1.0")
print(" • 0.0 = Perfect equality (everyone has same amount)")
print(" • 1.0 = Perfect inequality (one person has everything)")
print(" • Used in economics for income inequality")
print(" • In our research: Measures how evenly CWEs are distributed")
print(" • Higher Gini = Few CWEs dominate (specialized vendor)")
print(" • Lower Gini = CWEs evenly spread (diversified vendor)")
print("\n3. STANDARD DEVIATION (SD):")
print(" • Measures how spread out data points are from the mean")
print(" • Same units as original data")
print(" • Low SD = Data clustered around mean")
print(" • High SD = Data spread out widely")
print(" • ~68% of data within ±1 SD of mean (normal distribution)")
print(f" • OUR RESULT: Coverage SD = {sd_coverage:.1f}% means vendors vary significantly")
print("\n4. PRACTICAL INTERPRETATION FOR OUR RESEARCH:")
print(" • r = 0.444 means vendors with more CWE types tend to be more specialized")
print(" • This seems counterintuitive but makes sense:")
print(" - Diverse vendors encounter many weakness types")
print(" - But still have dominant vulnerability patterns")
print(" • SD = 12.7% shows substantial vendor differences")
print(" • GitHub (60.6%) vs RedHat Commercial (27.7%) = 33% difference!")
return {
'correlation_example': r_research,
'coverage_mean': mean_coverage,
'coverage_sd': sd_coverage,
'vendor_data': {
'vendors': vendors,
'coverage': coverage_pct,
'specialization': specialization_pct
}
}
# Execute the visualization
results = create_statistical_concepts_visualization()
print("\n=== STATISTICAL CONCEPTS VISUALIZATION COMPLETE ===")
print("Generated comprehensive explanation with examples from our research!")
# Set style for better visualizations plt.style.use('default') sns.set_palette("husl") def create_statistical_concepts_visualization(): """ Create comprehensive visualizations explaining R-value, Gini coefficient, and Standard Deviation """ print("Creating Statistical Concepts Visualization...") # Create a large figure with multiple subplots fig = plt.figure(figsize=(20, 16)) # ================================================================= # PART 1: CORRELATION COEFFICIENT (R-VALUE) EXPLANATION # ================================================================= print("1. Explaining Correlation Coefficient (R-value)...") # Create sample data for different correlation scenarios np.random.seed(42) n_points = 50 # Perfect positive correlation (r ≈ 1.0) x1 = np.linspace(0, 10, n_points) y1 = 2 * x1 + np.random.normal(0, 0.5, n_points) r1 = np.corrcoef(x1, y1)[0, 1] # Perfect negative correlation (r ≈ -1.0) x2 = np.linspace(0, 10, n_points) y2 = -1.5 * x2 + 15 + np.random.normal(0, 0.5, n_points) r2 = np.corrcoef(x2, y2)[0, 1] # No correlation (r ≈ 0) x3 = np.random.uniform(0, 10, n_points) y3 = np.random.uniform(0, 10, n_points) r3 = np.corrcoef(x3, y3)[0, 1] # Moderate positive correlation (r ≈ 0.6) x4 = np.linspace(0, 10, n_points) y4 = 1.2 * x4 + np.random.normal(0, 2, n_points) r4 = np.corrcoef(x4, y4)[0, 1] # Subplot 1: Strong Positive Correlation ax1 = plt.subplot(4, 3, 1) plt.scatter(x1, y1, alpha=0.7, color='blue', s=60) z1 = np.polyfit(x1, y1, 1) p1 = np.poly1d(z1) plt.plot(x1, p1(x1), "r--", alpha=0.8, linewidth=2) plt.title(f'Strong Positive Correlation\nr = {r1:.3f}', fontsize=12, fontweight='bold') plt.xlabel('X Variable', fontsize=10) plt.ylabel('Y Variable', fontsize=10) plt.grid(alpha=0.3) # Subplot 2: Strong Negative Correlation ax2 = plt.subplot(4, 3, 2) plt.scatter(x2, y2, alpha=0.7, color='red', s=60) z2 = np.polyfit(x2, y2, 1) p2 = np.poly1d(z2) plt.plot(x2, p2(x2), "r--", alpha=0.8, linewidth=2) plt.title(f'Strong Negative Correlation\nr = {r2:.3f}', fontsize=12, fontweight='bold') plt.xlabel('X Variable', fontsize=10) plt.ylabel('Y Variable', fontsize=10) plt.grid(alpha=0.3) # Subplot 3: No Correlation ax3 = plt.subplot(4, 3, 3) plt.scatter(x3, y3, alpha=0.7, color='green', s=60) z3 = np.polyfit(x3, y3, 1) p3 = np.poly1d(z3) plt.plot(x3, p3(x3), "r--", alpha=0.8, linewidth=2) plt.title(f'No Correlation\nr = {r3:.3f}', fontsize=12, fontweight='bold') plt.xlabel('X Variable', fontsize=10) plt.ylabel('Y Variable', fontsize=10) plt.grid(alpha=0.3) # ================================================================= # PART 2: GINI COEFFICIENT EXPLANATION # ================================================================= print("2. Explaining Gini Coefficient...") # Create sample data for different inequality scenarios # Perfect equality (Gini = 0) equal_data = [20, 20, 20, 20, 20] # Everyone has equal amount # High inequality (Gini ≈ 0.8) unequal_data = [1, 2, 3, 4, 90] # One person has most # Moderate inequality (Gini ≈ 0.4) moderate_data = [10, 15, 20, 25, 30] # Some inequality def calculate_gini(data): """Calculate Gini coefficient""" data = np.array(data, dtype=float) data = np.sort(data) n = len(data) cumsum = np.cumsum(data) return (n + 1 - 2 * np.sum(cumsum) / cumsum[-1]) / n def plot_lorenz_curve(data, ax, title, color): """Plot Lorenz curve for Gini visualization""" data = np.array(data, dtype=float) data = np.sort(data) n = len(data) # Calculate cumulative percentages cum_people = np.arange(1, n + 1) / n cum_wealth = np.cumsum(data) / np.sum(data) # Add origin point cum_people = np.insert(cum_people, 0, 0) cum_wealth = np.insert(cum_wealth, 0, 0) # Plot Lorenz curve ax.plot(cum_people, cum_wealth, color=color, linewidth=3, label='Lorenz Curve') ax.fill_between(cum_people, cum_wealth, cum_people, alpha=0.3, color=color, label='Inequality Area') # Plot line of equality ax.plot([0, 1], [0, 1], 'k--', linewidth=2, alpha=0.7, label='Perfect Equality') # Calculate and display Gini gini = calculate_gini(data) ax.set_title(f'{title}\nGini = {gini:.3f}', fontsize=12, fontweight='bold') ax.set_xlabel('Cumulative % of People', fontsize=10) ax.set_ylabel('Cumulative % of Resource', fontsize=10) ax.grid(alpha=0.3) ax.legend(fontsize=8) return gini # Subplot 4: Perfect Equality ax4 = plt.subplot(4, 3, 4) gini1 = plot_lorenz_curve(equal_data, ax4, 'Perfect Equality', 'green') # Subplot 5: Moderate Inequality ax5 = plt.subplot(4, 3, 5) gini2 = plot_lorenz_curve(moderate_data, ax5, 'Moderate Inequality', 'orange') # Subplot 6: High Inequality ax6 = plt.subplot(4, 3, 6) gini3 = plot_lorenz_curve(unequal_data, ax6, 'High Inequality', 'red') # ================================================================= # PART 3: STANDARD DEVIATION EXPLANATION # ================================================================= print("3. Explaining Standard Deviation (SD)...") # Create datasets with different standard deviations np.random.seed(42) # Low SD (data clustered around mean) low_sd_data = np.random.normal(50, 5, 1000) # mean=50, sd=5 # Medium SD med_sd_data = np.random.normal(50, 15, 1000) # mean=50, sd=15 # High SD (data spread out) high_sd_data = np.random.normal(50, 25, 1000) # mean=50, sd=25 # Subplot 7: Low Standard Deviation ax7 = plt.subplot(4, 3, 7) plt.hist(low_sd_data, bins=30, alpha=0.7, color='blue', density=True) mean_low = np.mean(low_sd_data) sd_low = np.std(low_sd_data) plt.axvline(mean_low, color='red', linestyle='--', linewidth=2, label=f'Mean = {mean_low:.1f}') plt.axvline(mean_low - sd_low, color='orange', linestyle=':', linewidth=2, alpha=0.7) plt.axvline(mean_low + sd_low, color='orange', linestyle=':', linewidth=2, alpha=0.7, label=f'±1 SD = ±{sd_low:.1f}') plt.title(f'Low Standard Deviation\nSD = {sd_low:.1f}', fontsize=12, fontweight='bold') plt.xlabel('Value', fontsize=10) plt.ylabel('Frequency', fontsize=10) plt.legend(fontsize=8) plt.grid(alpha=0.3) # Subplot 8: Medium Standard Deviation ax8 = plt.subplot(4, 3, 8) plt.hist(med_sd_data, bins=30, alpha=0.7, color='orange', density=True) mean_med = np.mean(med_sd_data) sd_med = np.std(med_sd_data) plt.axvline(mean_med, color='red', linestyle='--', linewidth=2, label=f'Mean = {mean_med:.1f}') plt.axvline(mean_med - sd_med, color='orange', linestyle=':', linewidth=2, alpha=0.7) plt.axvline(mean_med + sd_med, color='orange', linestyle=':', linewidth=2, alpha=0.7, label=f'±1 SD = ±{sd_med:.1f}') plt.title(f'Medium Standard Deviation\nSD = {sd_med:.1f}', fontsize=12, fontweight='bold') plt.xlabel('Value', fontsize=10) plt.ylabel('Frequency', fontsize=10) plt.legend(fontsize=8) plt.grid(alpha=0.3) # Subplot 9: High Standard Deviation ax9 = plt.subplot(4, 3, 9) plt.hist(high_sd_data, bins=30, alpha=0.7, color='red', density=True) mean_high = np.mean(high_sd_data) sd_high = np.std(high_sd_data) plt.axvline(mean_high, color='red', linestyle='--', linewidth=2, label=f'Mean = {mean_high:.1f}') plt.axvline(mean_high - sd_high, color='orange', linestyle=':', linewidth=2, alpha=0.7) plt.axvline(mean_high + sd_high, color='orange', linestyle=':', linewidth=2, alpha=0.7, label=f'±1 SD = ±{sd_high:.1f}') plt.title(f'High Standard Deviation\nSD = {sd_high:.1f}', fontsize=12, fontweight='bold') plt.xlabel('Value', fontsize=10) plt.ylabel('Frequency', fontsize=10) plt.legend(fontsize=8) plt.grid(alpha=0.3) # ================================================================= # PART 4: PRACTICAL EXAMPLES FROM OUR RESEARCH # ================================================================= print("4. Creating Practical Examples from OUR Research...") # Example from our vendor analysis vendors = ['Microsoft', 'Cisco', 'RedHat Com.', 'RedHat OS', 'GitHub OS'] coverage_pct = [42.8, 34.7, 27.7, 49.0, 60.6] specialization_pct = [11.8, 16.8, 7.1, 10.0, 15.7] # Subplot 10: our Research Example - Correlation ax10 = plt.subplot(4, 3, 10) plt.scatter(coverage_pct, specialization_pct, s=100, alpha=0.8, color=['blue', 'orange', 'green', 'red', 'purple']) # Add vendor labels for i, vendor in enumerate(vendors): plt.annotate(vendor, (coverage_pct[i], specialization_pct[i]), xytext=(5, 5), textcoords='offset points', fontsize=9) # Calculate correlation r_research = np.corrcoef(coverage_pct, specialization_pct)[0, 1] # Add trend line z = np.polyfit(coverage_pct, specialization_pct, 1) p = np.poly1d(z) plt.plot(coverage_pct, p(coverage_pct), "r--", alpha=0.8, linewidth=2) plt.title(f'our Research: Coverage vs Specialization\nr = {r_research:.3f}', fontsize=12, fontweight='bold') plt.xlabel('CWE Coverage %', fontsize=10) plt.ylabel('Specialization %', fontsize=10) plt.grid(alpha=0.3) # Subplot 11: our Research Example - Standard Deviation ax11 = plt.subplot(4, 3, 11) mean_coverage = np.mean(coverage_pct) sd_coverage = np.std(coverage_pct, ddof=1) # Sample standard deviation bars = plt.bar(vendors, coverage_pct, alpha=0.7, color=['blue', 'orange', 'green', 'red', 'purple']) plt.axhline(mean_coverage, color='red', linestyle='--', linewidth=2, label=f'Mean = {mean_coverage:.1f}%') plt.axhline(mean_coverage + sd_coverage, color='orange', linestyle=':', alpha=0.7, label=f'+1 SD = {mean_coverage + sd_coverage:.1f}%') plt.axhline(mean_coverage - sd_coverage, color='orange', linestyle=':', alpha=0.7, label=f'-1 SD = {mean_coverage - sd_coverage:.1f}%') plt.title(f'our Research: CWE Coverage\nMean = {mean_coverage:.1f}%, SD = {sd_coverage:.1f}%', fontsize=12, fontweight='bold') plt.ylabel('CWE Coverage %', fontsize=10) plt.xticks(rotation=45, ha='right') plt.legend(fontsize=8) plt.grid(alpha=0.3) # Subplot 12: Summary Interpretation ax12 = plt.subplot(4, 3, 12) ax12.axis('off') # Add interpretation text interpretation_text = f""" INTERPRETATION OF OUR RESULTS: R-VALUE (Correlation): • r = {r_research:.3f} shows MODERATE POSITIVE correlation • Vendors with higher CWE coverage tend to have higher specialization (but not perfectly) STANDARD DEVIATION: • Coverage SD = {sd_coverage:.1f}% shows MODERATE spread • Vendors vary significantly in CWE diversity • GitHub (60.6%) much higher than RedHat Com. (27.7%) GINI COEFFICIENT USAGE: • In our paper: Measures CWE distribution inequality • Higher Gini = More concentrated vulnerabilities • Lower Gini = More evenly distributed vulnerabilities """ ax12.text(0.05, 0.95, interpretation_text, transform=ax12.transAxes, fontsize=10, verticalalignment='top', fontfamily='monospace', bbox=dict(boxstyle="round,pad=0.5", facecolor="lightblue", alpha=0.8)) # Add overall title fig.suptitle('Statistical Concepts Explained: R-value, Gini Coefficient, and Standard Deviation', fontsize=16, fontweight='bold', y=0.98) plt.tight_layout() plt.subplots_adjust(top=0.90) plt.savefig('statistical_concepts_explanation.png', dpi=300, bbox_inches='tight') plt.show() # ================================================================= # DETAILED EXPLANATIONS # ================================================================= print("\n" + "="*80) print("DETAILED EXPLANATIONS") print("="*80) print("\n. CORRELATION COEFFICIENT (R-VALUE):") print(" • Range: -1.0 to +1.0") print(" • +1.0 = Perfect positive relationship (as X increases, Y increases)") print(" • -1.0 = Perfect negative relationship (as X increases, Y decreases)") print(" • 0.0 = No linear relationship") print(" • |r| > 0.7 = Strong correlation") print(" • 0.3 < |r| < 0.7 = Moderate correlation") print(" • |r| < 0.3 = Weak correlation") print(f" • OUR RESULT: r = {r_research:.3f} = MODERATE positive correlation") print("\n2. GINI COEFFICIENT:") print(" • Range: 0.0 to 1.0") print(" • 0.0 = Perfect equality (everyone has same amount)") print(" • 1.0 = Perfect inequality (one person has everything)") print(" • Used in economics for income inequality") print(" • In our research: Measures how evenly CWEs are distributed") print(" • Higher Gini = Few CWEs dominate (specialized vendor)") print(" • Lower Gini = CWEs evenly spread (diversified vendor)") print("\n3. STANDARD DEVIATION (SD):") print(" • Measures how spread out data points are from the mean") print(" • Same units as original data") print(" • Low SD = Data clustered around mean") print(" • High SD = Data spread out widely") print(" • ~68% of data within ±1 SD of mean (normal distribution)") print(f" • OUR RESULT: Coverage SD = {sd_coverage:.1f}% means vendors vary significantly") print("\n4. PRACTICAL INTERPRETATION FOR OUR RESEARCH:") print(" • r = 0.444 means vendors with more CWE types tend to be more specialized") print(" • This seems counterintuitive but makes sense:") print(" - Diverse vendors encounter many weakness types") print(" - But still have dominant vulnerability patterns") print(" • SD = 12.7% shows substantial vendor differences") print(" • GitHub (60.6%) vs RedHat Commercial (27.7%) = 33% difference!") return { 'correlation_example': r_research, 'coverage_mean': mean_coverage, 'coverage_sd': sd_coverage, 'vendor_data': { 'vendors': vendors, 'coverage': coverage_pct, 'specialization': specialization_pct } } # Execute the visualization results = create_statistical_concepts_visualization() print("\n=== STATISTICAL CONCEPTS VISUALIZATION COMPLETE ===") print("Generated comprehensive explanation with examples from our research!")
Creating Statistical Concepts Visualization... 1. Explaining Correlation Coefficient (R-value)... 2. Explaining Gini Coefficient... 3. Explaining Standard Deviation (SD)... 4. Creating Practical Examples from OUR Research...
================================================================================
DETAILED EXPLANATIONS
================================================================================
. CORRELATION COEFFICIENT (R-VALUE):
• Range: -1.0 to +1.0
• +1.0 = Perfect positive relationship (as X increases, Y increases)
• -1.0 = Perfect negative relationship (as X increases, Y decreases)
• 0.0 = No linear relationship
• |r| > 0.7 = Strong correlation
• 0.3 < |r| < 0.7 = Moderate correlation
• |r| < 0.3 = Weak correlation
• OUR RESULT: r = 0.432 = MODERATE positive correlation
2. GINI COEFFICIENT:
• Range: 0.0 to 1.0
• 0.0 = Perfect equality (everyone has same amount)
• 1.0 = Perfect inequality (one person has everything)
• Used in economics for income inequality
• In our research: Measures how evenly CWEs are distributed
• Higher Gini = Few CWEs dominate (specialized vendor)
• Lower Gini = CWEs evenly spread (diversified vendor)
3. STANDARD DEVIATION (SD):
• Measures how spread out data points are from the mean
• Same units as original data
• Low SD = Data clustered around mean
• High SD = Data spread out widely
• ~68% of data within ±1 SD of mean (normal distribution)
• OUR RESULT: Coverage SD = 12.7% means vendors vary significantly
4. PRACTICAL INTERPRETATION FOR OUR RESEARCH:
• r = 0.444 means vendors with more CWE types tend to be more specialized
• This seems counterintuitive but makes sense:
- Diverse vendors encounter many weakness types
- But still have dominant vulnerability patterns
• SD = 12.7% shows substantial vendor differences
• GitHub (60.6%) vs RedHat Commercial (27.7%) = 33% difference!
=== STATISTICAL CONCEPTS VISUALIZATION COMPLETE ===
Generated comprehensive explanation with examples from our research!
In [ ]:
Copied!
# Set academic style
plt.style.use('default')
sns.set_palette("husl")
def create_academic_cwe_visualizations(vendor_results):
"""
Create publication-quality academic visualizations for CWE analysis
"""
print("Creating Academic-Quality CWE Visualizations...")
# Prepare data from vendor results
vendors_data = {
'Microsoft': {'data': vendor_results['microsoft'], 'count_col': 'ms_count', 'color': '#1f77b4'},
'Cisco': {'data': vendor_results['cisco'], 'count_col': 'cisco_count', 'color': '#ff7f0e'},
'RedHat Commercial': {'data': vendor_results['redhat_commercial'], 'count_col': 'rh_comm_count', 'color': '#2ca02c'},
'RedHat Open-Source': {'data': vendor_results['redhat_opensource'], 'count_col': 'rh_os_count', 'color': '#d62728'},
'GitHub Open-Source': {'data': vendor_results['github'], 'count_col': 'gh_count', 'color': '#9467bd'}
}
# Create summary statistics
summary_stats = []
for vendor, info in vendors_data.items():
if not info['data'].empty:
total_cves = info['data'][info['count_col']].sum()
unique_cwes = len(info['data'])
top_cwe = info['data'].iloc[0]['cwe_id']
top_count = info['data'].iloc[0][info['count_col']]
specialization = (top_count / total_cves) * 100
coverage = (unique_cwes / vendor_results['total_cwes']) * 100
summary_stats.append({
'Vendor': vendor,
'Total_CVEs': total_cves,
'Unique_CWEs': unique_cwes,
'Coverage_Percent': coverage,
'Specialization_Percent': specialization,
'Top_CWE': top_cwe,
'Top_Count': top_count,
'Color': info['color']
})
summary_df = pd.DataFrame(summary_stats)
# =================================================================
# FIGURE 1: Comprehensive Vendor Comparison (2x2 Grid)
# =================================================================
fig1, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 12))
fig1.suptitle('Vendor-Specific Common Weakness Enumeration (CWE) Analysis',
fontsize=16, fontweight='bold', y=0.98)
# Subplot A: CWE Coverage Percentage
bars1 = ax1.bar(summary_df['Vendor'], summary_df['Coverage_Percent'],
color=summary_df['Color'], alpha=0.8, edgecolor='black', linewidth=1)
ax1.set_ylabel('CWE Coverage Percentage (%)', fontsize=12, fontweight='bold')
ax1.set_title('(A) CWE Diversity by Vendor', fontsize=13, fontweight='bold')
ax1.grid(axis='y', alpha=0.3, linestyle='--')
ax1.set_ylim(0, max(summary_df['Coverage_Percent']) * 1.1)
# Add value labels
for bar, value in zip(bars1, summary_df['Coverage_Percent']):
height = bar.get_height()
ax1.text(bar.get_x() + bar.get_width()/2., height + 0.5,
f'{value:.1f}%', ha='center', va='bottom', fontweight='bold', fontsize=10)
# Rotate x-axis labels
ax1.tick_params(axis='x', rotation=45, labelsize=10)
ax1.tick_params(axis='y', labelsize=10)
# Subplot B: Specialization vs Coverage Scatter
scatter = ax2.scatter(summary_df['Coverage_Percent'], summary_df['Specialization_Percent'],
c=summary_df['Color'], s=summary_df['Total_CVEs']/50,
alpha=0.7, edgecolors='black', linewidth=1)
ax2.set_xlabel('CWE Coverage Percentage (%)', fontsize=12, fontweight='bold')
ax2.set_ylabel('Specialization Percentage (%)', fontsize=12, fontweight='bold')
ax2.set_title('(B) Specialization vs. Coverage Analysis', fontsize=13, fontweight='bold')
ax2.grid(alpha=0.3, linestyle='--')
# Add vendor labels
for _, row in summary_df.iterrows():
ax2.annotate(row['Vendor'],
(row['Coverage_Percent'], row['Specialization_Percent']),
xytext=(5, 5), textcoords='offset points', fontsize=9,
bbox=dict(boxstyle='round,pad=0.3', facecolor='white', alpha=0.7))
ax2.tick_params(axis='both', labelsize=10)
# Subplot C: Total CVEs Distribution
bars3 = ax3.bar(summary_df['Vendor'], summary_df['Total_CVEs'],
color=summary_df['Color'], alpha=0.8, edgecolor='black', linewidth=1)
ax3.set_ylabel('Total CVEs', fontsize=12, fontweight='bold')
ax3.set_title('(C) Vulnerability Volume by Vendor', fontsize=13, fontweight='bold')
ax3.grid(axis='y', alpha=0.3, linestyle='--')
# Add value labels
for bar, value in zip(bars3, summary_df['Total_CVEs']):
height = bar.get_height()
ax3.text(bar.get_x() + bar.get_width()/2., height + max(summary_df['Total_CVEs']) * 0.01,
f'{int(value):,}', ha='center', va='bottom', fontweight='bold', fontsize=10)
ax3.tick_params(axis='x', rotation=45, labelsize=10)
ax3.tick_params(axis='y', labelsize=10)
# Subplot D: Unique CWEs Count
bars4 = ax4.bar(summary_df['Vendor'], summary_df['Unique_CWEs'],
color=summary_df['Color'], alpha=0.8, edgecolor='black', linewidth=1)
ax4.set_ylabel('Number of Unique CWEs', fontsize=12, fontweight='bold')
ax4.set_title('(D) CWE Diversity Count by Vendor', fontsize=13, fontweight='bold')
ax4.grid(axis='y', alpha=0.3, linestyle='--')
# Add value labels
for bar, value in zip(bars4, summary_df['Unique_CWEs']):
height = bar.get_height()
ax4.text(bar.get_x() + bar.get_width()/2., height + 5,
f'{int(value)}', ha='center', va='bottom', fontweight='bold', fontsize=10)
ax4.tick_params(axis='x', rotation=45, labelsize=10)
ax4.tick_params(axis='y', labelsize=10)
plt.tight_layout()
plt.savefig('Figure1_Vendor_CWE_Comprehensive_Analysis.png', dpi=300, bbox_inches='tight',
facecolor='white', edgecolor='none')
plt.savefig('Figure1_Vendor_CWE_Comprehensive_Analysis.eps', dpi=300, bbox_inches='tight',
facecolor='white', edgecolor='none')
plt.show()
# =================================================================
# FIGURE 2: Top CWEs Comparison Heatmap
# =================================================================
# Prepare data for heatmap - Top 15 CWEs across all vendors
all_cwes = set()
for vendor, info in vendors_data.items():
if not info['data'].empty:
all_cwes.update(info['data'].head(15)['cwe_id'].tolist())
# Create matrix for heatmap
heatmap_data = []
vendor_names = []
for vendor, info in vendors_data.items():
if not info['data'].empty:
vendor_names.append(vendor)
vendor_row = []
vendor_data_dict = dict(zip(info['data']['cwe_id'], info['data'][info['count_col']]))
for cwe in sorted(all_cwes):
vendor_row.append(vendor_data_dict.get(cwe, 0))
heatmap_data.append(vendor_row)
heatmap_df = pd.DataFrame(heatmap_data,
index=vendor_names,
columns=sorted(all_cwes))
# Select top 20 CWEs by total occurrence
cwe_totals = heatmap_df.sum(axis=0).sort_values(ascending=False)
top_cwes = cwe_totals.head(20).index.tolist()
heatmap_subset = heatmap_df[top_cwes]
# Create heatmap
fig2, ax = plt.subplots(figsize=(16, 8))
# Use log scale for better visualization
heatmap_log = np.log1p(heatmap_subset) # log(1+x) to handle zeros
sns.heatmap(heatmap_log, annot=heatmap_subset, fmt='g', cmap='YlOrRd',
ax=ax, cbar_kws={'label': 'CVE Count (Log Scale)'},
linewidths=0.5, linecolor='white')
ax.set_title('Figure 2: Vendor-Specific CWE Distribution Heatmap\n(Top 20 CWEs by Total Occurrence)',
fontsize=14, fontweight='bold', pad=20)
ax.set_xlabel('Common Weakness Enumeration (CWE) ID', fontsize=12, fontweight='bold')
ax.set_ylabel('Vendor/Platform', fontsize=12, fontweight='bold')
# Improve tick labels
ax.tick_params(axis='x', rotation=45, labelsize=10)
ax.tick_params(axis='y', rotation=0, labelsize=10)
plt.tight_layout()
plt.savefig('Figure2_CWE_Vendor_Heatmap.png', dpi=300, bbox_inches='tight',
facecolor='white', edgecolor='none')
plt.savefig('Figure2_CWE_Vendor_Heatmap.eps', dpi=300, bbox_inches='tight',
facecolor='white', edgecolor='none')
plt.show()
# =================================================================
# FIGURE 3: CWE Category Analysis
# =================================================================
# Define CWE categories for academic analysis
cwe_categories = {
'Memory Management': ['CWE-119', 'CWE-120', 'CWE-121', 'CWE-122', 'CWE-125', 'CWE-126',
'CWE-416', 'CWE-476', 'CWE-787', 'CWE-822', 'CWE-824'],
'Input Validation': ['CWE-20', 'CWE-79', 'CWE-89', 'CWE-94', 'CWE-190', 'CWE-22',
'CWE-78', 'CWE-77', 'CWE-91'],
'Access Control': ['CWE-264', 'CWE-269', 'CWE-284', 'CWE-285', 'CWE-287', 'CWE-352',
'CWE-862', 'CWE-863', 'CWE-918'],
'Information Disclosure': ['CWE-200', 'CWE-209', 'CWE-215', 'CWE-359', 'CWE-532'],
'Resource Management': ['CWE-399', 'CWE-400', 'CWE-401', 'CWE-770', 'CWE-771', 'CWE-362'],
'Cryptographic Issues': ['CWE-310', 'CWE-311', 'CWE-326', 'CWE-327', 'CWE-347'],
'Configuration': ['CWE-16', 'CWE-502', 'CWE-611', 'CWE-693', 'CWE-444']
}
# Calculate category distributions for each vendor
category_data = []
for vendor, info in vendors_data.items():
if not info['data'].empty:
vendor_categories = {cat: 0 for cat in cwe_categories.keys()}
vendor_categories['Other'] = 0
for _, row in info['data'].iterrows():
cwe_id = row['cwe_id']
count = row[info['count_col']]
categorized = False
for category, cwes in cwe_categories.items():
if cwe_id in cwes:
vendor_categories[category] += count
categorized = True
break
if not categorized:
vendor_categories['Other'] += count
for category, count in vendor_categories.items():
category_data.append({
'Vendor': vendor,
'Category': category,
'Count': count,
'Color': info['color']
})
category_df = pd.DataFrame(category_data)
# Create stacked bar chart
fig3, ax = plt.subplots(figsize=(14, 8))
# Pivot data for stacked bar chart
pivot_df = category_df.pivot(index='Vendor', columns='Category', values='Count').fillna(0)
# Define colors for categories
category_colors = {
'Memory Management': '#e41a1c',
'Input Validation': '#377eb8',
'Access Control': '#4daf4a',
'Information Disclosure': '#984ea3',
'Resource Management': '#ff7f00',
'Cryptographic Issues': '#ffff33',
'Configuration': '#a65628',
'Other': '#999999'
}
# Create stacked bars
bottom = np.zeros(len(pivot_df))
for category in pivot_df.columns:
if category in category_colors:
ax.bar(pivot_df.index, pivot_df[category], bottom=bottom,
label=category, color=category_colors[category], alpha=0.8,
edgecolor='white', linewidth=0.5)
bottom += pivot_df[category]
ax.set_title('Figure 3: Vulnerability Distribution by CWE Category\n(Stacked Bar Chart)',
fontsize=14, fontweight='bold', pad=20)
ax.set_xlabel('Vendor/Platform', fontsize=12, fontweight='bold')
ax.set_ylabel('Number of CVEs', fontsize=12, fontweight='bold')
# Improve legend
ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10)
# Rotate x-axis labels
ax.tick_params(axis='x', rotation=45, labelsize=10)
ax.tick_params(axis='y', labelsize=10)
ax.grid(axis='y', alpha=0.3, linestyle='--')
plt.tight_layout()
plt.savefig('Figure3_CWE_Category_Distribution.png', dpi=300, bbox_inches='tight',
facecolor='white', edgecolor='none')
plt.savefig('Figure3_CWE_Category_Distribution.eps', dpi=300, bbox_inches='tight',
facecolor='white', edgecolor='none')
plt.show()
# =================================================================
# FIGURE 4: Vendor Specialization Analysis (Radar Chart)
# =================================================================
# Calculate specialization metrics for each vendor
specialization_metrics = []
for vendor, info in vendors_data.items():
if not info['data'].empty:
data = info['data'].head(10) # Top 10 CWEs
# Calculate various specialization metrics
total_top10 = data[info['count_col']].sum()
top1_ratio = data.iloc[0][info['count_col']] / total_top10 * 100
top3_ratio = data.head(3)[info['count_col']].sum() / total_top10 * 100
top5_ratio = data.head(5)[info['count_col']].sum() / total_top10 * 100
# Gini coefficient for inequality
counts = data[info['count_col']].values
gini = 2 * np.sum(np.arange(1, len(counts) + 1) * np.sort(counts)) / (len(counts) * np.sum(counts)) - (len(counts) + 1) / len(counts)
gini_normalized = gini * 100
specialization_metrics.append({
'Vendor': vendor,
'Top_1_Dominance': top1_ratio,
'Top_3_Concentration': top3_ratio,
'Top_5_Concentration': top5_ratio,
'Inequality_Index': gini_normalized,
'Color': info['color']
})
spec_df = pd.DataFrame(specialization_metrics)
# Create radar chart
fig4, ax = plt.subplots(figsize=(12, 12), subplot_kw=dict(projection='polar'))
# Set up radar chart
categories = ['Top 1\nDominance (%)', 'Top 3\nConcentration (%)',
'Top 5\nConcentration (%)', 'Inequality\nIndex (%)']
N = len(categories)
# Compute angle for each category
angles = [n / float(N) * 2 * np.pi for n in range(N)]
angles += angles[:1] # Complete the circle
# Plot each vendor
for _, row in spec_df.iterrows():
values = [row['Top_1_Dominance'], row['Top_3_Concentration'],
row['Top_5_Concentration'], row['Inequality_Index']]
values += values[:1] # Complete the circle
ax.plot(angles, values, 'o-', linewidth=2, label=row['Vendor'],
color=row['Color'], alpha=0.8, markersize=6)
ax.fill(angles, values, alpha=0.1, color=row['Color'])
# Customize radar chart
ax.set_xticks(angles[:-1])
ax.set_xticklabels(categories, fontsize=11, fontweight='bold')
ax.set_ylim(0, 100)
ax.set_yticks([20, 40, 60, 80, 100])
ax.set_yticklabels(['20%', '40%', '60%', '80%', '100%'], fontsize=10)
ax.grid(True, alpha=0.3)
# Add title and legend
plt.title('Figure 4: Vendor Vulnerability Specialization Profile\n(Radar Chart Analysis)',
fontsize=14, fontweight='bold', pad=30)
plt.legend(loc='upper right', bbox_to_anchor=(1.3, 1.0), fontsize=10)
plt.tight_layout()
plt.savefig('Figure4_Vendor_Specialization_Radar.png', dpi=300, bbox_inches='tight',
facecolor='white', edgecolor='none')
plt.savefig('Figure4_Vendor_Specialization_Radar.eps', dpi=300, bbox_inches='tight',
facecolor='white', edgecolor='none')
plt.show()
# =================================================================
# FIGURE 5: Statistical Analysis and Correlations
# =================================================================
fig5, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 12))
fig5.suptitle('Figure 5: Statistical Analysis of Vendor CWE Patterns',
fontsize=16, fontweight='bold', y=0.98)
# Subplot A: Coverage vs Specialization Correlation
x = summary_df['Coverage_Percent']
y = summary_df['Specialization_Percent']
# Calculate correlation
correlation = np.corrcoef(x, y)[0, 1]
ax1.scatter(x, y, c=summary_df['Color'], s=100, alpha=0.8, edgecolors='black')
# Add trend line
z = np.polyfit(x, y, 1)
p = np.poly1d(z)
ax1.plot(x, p(x), "r--", alpha=0.8, linewidth=2)
ax1.set_xlabel('CWE Coverage Percentage (%)', fontsize=12, fontweight='bold')
ax1.set_ylabel('Specialization Percentage (%)', fontsize=12, fontweight='bold')
ax1.set_title(f'(A) Coverage vs Specialization\n(r = {correlation:.3f})',
fontsize=13, fontweight='bold')
ax1.grid(alpha=0.3, linestyle='--')
# Add vendor labels
for _, row in summary_df.iterrows():
ax1.annotate(row['Vendor'], (row['Coverage_Percent'], row['Specialization_Percent']),
xytext=(5, 5), textcoords='offset points', fontsize=9)
# Subplot B: CVE Volume Distribution
# --- Subplot B: CVE Volume Distribution (Histogram with Vendor Labels) ---
n_bins = 10 # Adjust bin count as needed
counts, bins, patches = ax2.hist(
summary_df['Total_CVEs'],
bins=n_bins,
color='skyblue',
edgecolor='black',
alpha=0.7
)
# Label vendors in each bin with vertical stacking
for i, (bin_start, bin_end) in enumerate(zip(bins[:-1], bins[1:])):
vendors_in_bin = summary_df[
(summary_df['Total_CVEs'] >= bin_start) &
(summary_df['Total_CVEs'] < bin_end)
]['Vendor'].tolist()
# Fix the upper bound issue: include right edge for last bin
if i == len(bins) - 2:
vendors_in_bin += summary_df[
(summary_df['Total_CVEs'] == bin_end)
]['Vendor'].tolist()
# Place each vendor label slightly above the bar with vertical stacking
for j, vendor in enumerate(vendors_in_bin):
ax2.text(
(bin_start + bin_end) / 2, # x: center of the bin
counts[i] + 0.2 + (j * 0.3), # y: staggered vertically
vendor,
ha='center',
va='bottom',
fontsize=8,
bbox=dict(boxstyle="round,pad=0.2", edgecolor='gray', facecolor='white', alpha=0.6)
)
ax2.set_xlabel('Total CVEs', fontsize=12, fontweight='bold')
ax2.set_ylabel('Frequency (Vendor Count)', fontsize=12, fontweight='bold')
ax2.set_title('(B) CVE Volume Distribution', fontsize=13, fontweight='bold')
ax2.grid(alpha=0.3, linestyle='--')
# Subplot C: CWE Diversity vs CVE Volume
ax3.scatter(summary_df['Total_CVEs'], summary_df['Unique_CWEs'],
c=summary_df['Color'], s=100, alpha=0.8, edgecolors='black')
# Add trend line
x_vol = summary_df['Total_CVEs']
y_div = summary_df['Unique_CWEs']
correlation2 = np.corrcoef(x_vol, y_div)[0, 1]
z2 = np.polyfit(x_vol, y_div, 1)
p2 = np.poly1d(z2)
ax3.plot(x_vol, p2(x_vol), "r--", alpha=0.8, linewidth=2)
ax3.set_xlabel('Total CVEs', fontsize=12, fontweight='bold')
ax3.set_ylabel('Unique CWEs', fontsize=12, fontweight='bold')
ax3.set_title(f'(C) Volume vs Diversity\n(r = {correlation2:.3f})',
fontsize=13, fontweight='bold')
ax3.grid(alpha=0.3, linestyle='--')
# Add vendor labels
for _, row in summary_df.iterrows():
ax3.annotate(row['Vendor'], (row['Total_CVEs'], row['Unique_CWEs']),
xytext=(5, 5), textcoords='offset points', fontsize=9)
# Subplot D: Vendor Ranking Matrix
metrics = ['Coverage_Percent', 'Total_CVEs', 'Specialization_Percent']
metric_names = ['Coverage %', 'Total CVEs', 'Specialization %']
# Normalize metrics to 0-1 scale for comparison
normalized_data = summary_df[metrics].copy()
for col in metrics:
normalized_data[col] = (normalized_data[col] - normalized_data[col].min()) / (normalized_data[col].max() - normalized_data[col].min())
# Create ranking heatmap
im = ax4.imshow(normalized_data.T, cmap='RdYlGn', aspect='auto', alpha=0.8)
# Set ticks and labels
ax4.set_xticks(range(len(summary_df)))
ax4.set_xticklabels(summary_df['Vendor'], rotation=45, ha='right')
ax4.set_yticks(range(len(metric_names)))
ax4.set_yticklabels(metric_names)
# Add text annotations
for i in range(len(metric_names)):
for j in range(len(summary_df)):
text = ax4.text(j, i, f'{normalized_data.iloc[j, i]:.2f}',
ha="center", va="center", color="black", fontweight='bold')
ax4.set_title('(D) Normalized Metric Comparison', fontsize=13, fontweight='bold')
# Add colorbar
cbar = plt.colorbar(im, ax=ax4, shrink=0.8)
cbar.set_label('Normalized Score (0-1)', fontsize=10, fontweight='bold')
plt.tight_layout()
plt.savefig('Figure5_Statistical_Analysis.png', dpi=300, bbox_inches='tight',
facecolor='white', edgecolor='none')
plt.savefig('Figure5_Statistical_Analysis.eps', dpi=300, bbox_inches='tight',
facecolor='white', edgecolor='none')
plt.show()
# =================================================================
# Print Summary Statistics for Academic Paper
# =================================================================
print("\n" + "="*80)
print("ACADEMIC SUMMARY STATISTICS")
print("="*80)
print(f"Dataset Overview:")
print(f"• Total unique CWEs analyzed: {vendor_results['total_cwes']}")
print(f"• Number of vendors/platforms: {len(summary_df)}")
print(f"• Total CVEs across all vendors: {summary_df['Total_CVEs'].sum():,}")
print(f"\nCoverage Statistics:")
print(f"• Mean CWE coverage: {summary_df['Coverage_Percent'].mean():.1f}% (SD: {summary_df['Coverage_Percent'].std():.1f}%)")
print(f"• Highest coverage: {summary_df['Coverage_Percent'].max():.1f}% ({summary_df.loc[summary_df['Coverage_Percent'].idxmax(), 'Vendor']})")
print(f"• Lowest coverage: {summary_df['Coverage_Percent'].min():.1f}% ({summary_df.loc[summary_df['Coverage_Percent'].idxmin(), 'Vendor']})")
print(f"\nSpecialization Statistics:")
print(f"• Mean specialization: {summary_df['Specialization_Percent'].mean():.1f}% (SD: {summary_df['Specialization_Percent'].std():.1f}%)")
print(f"• Most specialized: {summary_df['Specialization_Percent'].max():.1f}% ({summary_df.loc[summary_df['Specialization_Percent'].idxmax(), 'Vendor']})")
print(f"• Least specialized: {summary_df['Specialization_Percent'].min():.1f}% ({summary_df.loc[summary_df['Specialization_Percent'].idxmin(), 'Vendor']})")
print(f"\nCorrelation Analysis:")
print(f"• Coverage vs Specialization: r = {correlation:.3f}")
print(f"• Volume vs Diversity: r = {correlation2:.3f}")
print("\n" + "="*80)
print("GENERATED ACADEMIC FIGURES:")
print("="*80)
print("• Figure1_Vendor_CWE_Comprehensive_Analysis.png/.eps")
print("• Figure2_CWE_Vendor_Heatmap.png/.eps")
print("• Figure3_CWE_Category_Distribution.png/.eps")
print("• Figure4_Vendor_Specialization_Radar.png/.eps")
print("• Figure5_Statistical_Analysis.png/.eps")
print("\nAll figures saved in both PNG (presentation) and EPS (publication) formats.")
return {
'summary_stats': summary_df,
'category_analysis': category_df,
'specialization_metrics': spec_df,
'correlations': {'coverage_specialization': correlation, 'volume_diversity': correlation2}
}
# Execute visualization creation
# Note: This function should be called with the vendor_results from the previous analysis
print("Academic CWE Visualization Suite Ready!")
print("Usage: visualization_results = create_academic_cwe_visualizations(corrected_vendor_results)")
print("\nThis will generate 5 publication-quality figures suitable for academic papers.")
# After running the corrected vendor analysis:
visualization_results = create_academic_cwe_visualizations(fixed_results)
# This generates all 5 figures and returns statistical summary
# Set academic style plt.style.use('default') sns.set_palette("husl") def create_academic_cwe_visualizations(vendor_results): """ Create publication-quality academic visualizations for CWE analysis """ print("Creating Academic-Quality CWE Visualizations...") # Prepare data from vendor results vendors_data = { 'Microsoft': {'data': vendor_results['microsoft'], 'count_col': 'ms_count', 'color': '#1f77b4'}, 'Cisco': {'data': vendor_results['cisco'], 'count_col': 'cisco_count', 'color': '#ff7f0e'}, 'RedHat Commercial': {'data': vendor_results['redhat_commercial'], 'count_col': 'rh_comm_count', 'color': '#2ca02c'}, 'RedHat Open-Source': {'data': vendor_results['redhat_opensource'], 'count_col': 'rh_os_count', 'color': '#d62728'}, 'GitHub Open-Source': {'data': vendor_results['github'], 'count_col': 'gh_count', 'color': '#9467bd'} } # Create summary statistics summary_stats = [] for vendor, info in vendors_data.items(): if not info['data'].empty: total_cves = info['data'][info['count_col']].sum() unique_cwes = len(info['data']) top_cwe = info['data'].iloc[0]['cwe_id'] top_count = info['data'].iloc[0][info['count_col']] specialization = (top_count / total_cves) * 100 coverage = (unique_cwes / vendor_results['total_cwes']) * 100 summary_stats.append({ 'Vendor': vendor, 'Total_CVEs': total_cves, 'Unique_CWEs': unique_cwes, 'Coverage_Percent': coverage, 'Specialization_Percent': specialization, 'Top_CWE': top_cwe, 'Top_Count': top_count, 'Color': info['color'] }) summary_df = pd.DataFrame(summary_stats) # ================================================================= # FIGURE 1: Comprehensive Vendor Comparison (2x2 Grid) # ================================================================= fig1, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 12)) fig1.suptitle('Vendor-Specific Common Weakness Enumeration (CWE) Analysis', fontsize=16, fontweight='bold', y=0.98) # Subplot A: CWE Coverage Percentage bars1 = ax1.bar(summary_df['Vendor'], summary_df['Coverage_Percent'], color=summary_df['Color'], alpha=0.8, edgecolor='black', linewidth=1) ax1.set_ylabel('CWE Coverage Percentage (%)', fontsize=12, fontweight='bold') ax1.set_title('(A) CWE Diversity by Vendor', fontsize=13, fontweight='bold') ax1.grid(axis='y', alpha=0.3, linestyle='--') ax1.set_ylim(0, max(summary_df['Coverage_Percent']) * 1.1) # Add value labels for bar, value in zip(bars1, summary_df['Coverage_Percent']): height = bar.get_height() ax1.text(bar.get_x() + bar.get_width()/2., height + 0.5, f'{value:.1f}%', ha='center', va='bottom', fontweight='bold', fontsize=10) # Rotate x-axis labels ax1.tick_params(axis='x', rotation=45, labelsize=10) ax1.tick_params(axis='y', labelsize=10) # Subplot B: Specialization vs Coverage Scatter scatter = ax2.scatter(summary_df['Coverage_Percent'], summary_df['Specialization_Percent'], c=summary_df['Color'], s=summary_df['Total_CVEs']/50, alpha=0.7, edgecolors='black', linewidth=1) ax2.set_xlabel('CWE Coverage Percentage (%)', fontsize=12, fontweight='bold') ax2.set_ylabel('Specialization Percentage (%)', fontsize=12, fontweight='bold') ax2.set_title('(B) Specialization vs. Coverage Analysis', fontsize=13, fontweight='bold') ax2.grid(alpha=0.3, linestyle='--') # Add vendor labels for _, row in summary_df.iterrows(): ax2.annotate(row['Vendor'], (row['Coverage_Percent'], row['Specialization_Percent']), xytext=(5, 5), textcoords='offset points', fontsize=9, bbox=dict(boxstyle='round,pad=0.3', facecolor='white', alpha=0.7)) ax2.tick_params(axis='both', labelsize=10) # Subplot C: Total CVEs Distribution bars3 = ax3.bar(summary_df['Vendor'], summary_df['Total_CVEs'], color=summary_df['Color'], alpha=0.8, edgecolor='black', linewidth=1) ax3.set_ylabel('Total CVEs', fontsize=12, fontweight='bold') ax3.set_title('(C) Vulnerability Volume by Vendor', fontsize=13, fontweight='bold') ax3.grid(axis='y', alpha=0.3, linestyle='--') # Add value labels for bar, value in zip(bars3, summary_df['Total_CVEs']): height = bar.get_height() ax3.text(bar.get_x() + bar.get_width()/2., height + max(summary_df['Total_CVEs']) * 0.01, f'{int(value):,}', ha='center', va='bottom', fontweight='bold', fontsize=10) ax3.tick_params(axis='x', rotation=45, labelsize=10) ax3.tick_params(axis='y', labelsize=10) # Subplot D: Unique CWEs Count bars4 = ax4.bar(summary_df['Vendor'], summary_df['Unique_CWEs'], color=summary_df['Color'], alpha=0.8, edgecolor='black', linewidth=1) ax4.set_ylabel('Number of Unique CWEs', fontsize=12, fontweight='bold') ax4.set_title('(D) CWE Diversity Count by Vendor', fontsize=13, fontweight='bold') ax4.grid(axis='y', alpha=0.3, linestyle='--') # Add value labels for bar, value in zip(bars4, summary_df['Unique_CWEs']): height = bar.get_height() ax4.text(bar.get_x() + bar.get_width()/2., height + 5, f'{int(value)}', ha='center', va='bottom', fontweight='bold', fontsize=10) ax4.tick_params(axis='x', rotation=45, labelsize=10) ax4.tick_params(axis='y', labelsize=10) plt.tight_layout() plt.savefig('Figure1_Vendor_CWE_Comprehensive_Analysis.png', dpi=300, bbox_inches='tight', facecolor='white', edgecolor='none') plt.savefig('Figure1_Vendor_CWE_Comprehensive_Analysis.eps', dpi=300, bbox_inches='tight', facecolor='white', edgecolor='none') plt.show() # ================================================================= # FIGURE 2: Top CWEs Comparison Heatmap # ================================================================= # Prepare data for heatmap - Top 15 CWEs across all vendors all_cwes = set() for vendor, info in vendors_data.items(): if not info['data'].empty: all_cwes.update(info['data'].head(15)['cwe_id'].tolist()) # Create matrix for heatmap heatmap_data = [] vendor_names = [] for vendor, info in vendors_data.items(): if not info['data'].empty: vendor_names.append(vendor) vendor_row = [] vendor_data_dict = dict(zip(info['data']['cwe_id'], info['data'][info['count_col']])) for cwe in sorted(all_cwes): vendor_row.append(vendor_data_dict.get(cwe, 0)) heatmap_data.append(vendor_row) heatmap_df = pd.DataFrame(heatmap_data, index=vendor_names, columns=sorted(all_cwes)) # Select top 20 CWEs by total occurrence cwe_totals = heatmap_df.sum(axis=0).sort_values(ascending=False) top_cwes = cwe_totals.head(20).index.tolist() heatmap_subset = heatmap_df[top_cwes] # Create heatmap fig2, ax = plt.subplots(figsize=(16, 8)) # Use log scale for better visualization heatmap_log = np.log1p(heatmap_subset) # log(1+x) to handle zeros sns.heatmap(heatmap_log, annot=heatmap_subset, fmt='g', cmap='YlOrRd', ax=ax, cbar_kws={'label': 'CVE Count (Log Scale)'}, linewidths=0.5, linecolor='white') ax.set_title('Figure 2: Vendor-Specific CWE Distribution Heatmap\n(Top 20 CWEs by Total Occurrence)', fontsize=14, fontweight='bold', pad=20) ax.set_xlabel('Common Weakness Enumeration (CWE) ID', fontsize=12, fontweight='bold') ax.set_ylabel('Vendor/Platform', fontsize=12, fontweight='bold') # Improve tick labels ax.tick_params(axis='x', rotation=45, labelsize=10) ax.tick_params(axis='y', rotation=0, labelsize=10) plt.tight_layout() plt.savefig('Figure2_CWE_Vendor_Heatmap.png', dpi=300, bbox_inches='tight', facecolor='white', edgecolor='none') plt.savefig('Figure2_CWE_Vendor_Heatmap.eps', dpi=300, bbox_inches='tight', facecolor='white', edgecolor='none') plt.show() # ================================================================= # FIGURE 3: CWE Category Analysis # ================================================================= # Define CWE categories for academic analysis cwe_categories = { 'Memory Management': ['CWE-119', 'CWE-120', 'CWE-121', 'CWE-122', 'CWE-125', 'CWE-126', 'CWE-416', 'CWE-476', 'CWE-787', 'CWE-822', 'CWE-824'], 'Input Validation': ['CWE-20', 'CWE-79', 'CWE-89', 'CWE-94', 'CWE-190', 'CWE-22', 'CWE-78', 'CWE-77', 'CWE-91'], 'Access Control': ['CWE-264', 'CWE-269', 'CWE-284', 'CWE-285', 'CWE-287', 'CWE-352', 'CWE-862', 'CWE-863', 'CWE-918'], 'Information Disclosure': ['CWE-200', 'CWE-209', 'CWE-215', 'CWE-359', 'CWE-532'], 'Resource Management': ['CWE-399', 'CWE-400', 'CWE-401', 'CWE-770', 'CWE-771', 'CWE-362'], 'Cryptographic Issues': ['CWE-310', 'CWE-311', 'CWE-326', 'CWE-327', 'CWE-347'], 'Configuration': ['CWE-16', 'CWE-502', 'CWE-611', 'CWE-693', 'CWE-444'] } # Calculate category distributions for each vendor category_data = [] for vendor, info in vendors_data.items(): if not info['data'].empty: vendor_categories = {cat: 0 for cat in cwe_categories.keys()} vendor_categories['Other'] = 0 for _, row in info['data'].iterrows(): cwe_id = row['cwe_id'] count = row[info['count_col']] categorized = False for category, cwes in cwe_categories.items(): if cwe_id in cwes: vendor_categories[category] += count categorized = True break if not categorized: vendor_categories['Other'] += count for category, count in vendor_categories.items(): category_data.append({ 'Vendor': vendor, 'Category': category, 'Count': count, 'Color': info['color'] }) category_df = pd.DataFrame(category_data) # Create stacked bar chart fig3, ax = plt.subplots(figsize=(14, 8)) # Pivot data for stacked bar chart pivot_df = category_df.pivot(index='Vendor', columns='Category', values='Count').fillna(0) # Define colors for categories category_colors = { 'Memory Management': '#e41a1c', 'Input Validation': '#377eb8', 'Access Control': '#4daf4a', 'Information Disclosure': '#984ea3', 'Resource Management': '#ff7f00', 'Cryptographic Issues': '#ffff33', 'Configuration': '#a65628', 'Other': '#999999' } # Create stacked bars bottom = np.zeros(len(pivot_df)) for category in pivot_df.columns: if category in category_colors: ax.bar(pivot_df.index, pivot_df[category], bottom=bottom, label=category, color=category_colors[category], alpha=0.8, edgecolor='white', linewidth=0.5) bottom += pivot_df[category] ax.set_title('Figure 3: Vulnerability Distribution by CWE Category\n(Stacked Bar Chart)', fontsize=14, fontweight='bold', pad=20) ax.set_xlabel('Vendor/Platform', fontsize=12, fontweight='bold') ax.set_ylabel('Number of CVEs', fontsize=12, fontweight='bold') # Improve legend ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10) # Rotate x-axis labels ax.tick_params(axis='x', rotation=45, labelsize=10) ax.tick_params(axis='y', labelsize=10) ax.grid(axis='y', alpha=0.3, linestyle='--') plt.tight_layout() plt.savefig('Figure3_CWE_Category_Distribution.png', dpi=300, bbox_inches='tight', facecolor='white', edgecolor='none') plt.savefig('Figure3_CWE_Category_Distribution.eps', dpi=300, bbox_inches='tight', facecolor='white', edgecolor='none') plt.show() # ================================================================= # FIGURE 4: Vendor Specialization Analysis (Radar Chart) # ================================================================= # Calculate specialization metrics for each vendor specialization_metrics = [] for vendor, info in vendors_data.items(): if not info['data'].empty: data = info['data'].head(10) # Top 10 CWEs # Calculate various specialization metrics total_top10 = data[info['count_col']].sum() top1_ratio = data.iloc[0][info['count_col']] / total_top10 * 100 top3_ratio = data.head(3)[info['count_col']].sum() / total_top10 * 100 top5_ratio = data.head(5)[info['count_col']].sum() / total_top10 * 100 # Gini coefficient for inequality counts = data[info['count_col']].values gini = 2 * np.sum(np.arange(1, len(counts) + 1) * np.sort(counts)) / (len(counts) * np.sum(counts)) - (len(counts) + 1) / len(counts) gini_normalized = gini * 100 specialization_metrics.append({ 'Vendor': vendor, 'Top_1_Dominance': top1_ratio, 'Top_3_Concentration': top3_ratio, 'Top_5_Concentration': top5_ratio, 'Inequality_Index': gini_normalized, 'Color': info['color'] }) spec_df = pd.DataFrame(specialization_metrics) # Create radar chart fig4, ax = plt.subplots(figsize=(12, 12), subplot_kw=dict(projection='polar')) # Set up radar chart categories = ['Top 1\nDominance (%)', 'Top 3\nConcentration (%)', 'Top 5\nConcentration (%)', 'Inequality\nIndex (%)'] N = len(categories) # Compute angle for each category angles = [n / float(N) * 2 * np.pi for n in range(N)] angles += angles[:1] # Complete the circle # Plot each vendor for _, row in spec_df.iterrows(): values = [row['Top_1_Dominance'], row['Top_3_Concentration'], row['Top_5_Concentration'], row['Inequality_Index']] values += values[:1] # Complete the circle ax.plot(angles, values, 'o-', linewidth=2, label=row['Vendor'], color=row['Color'], alpha=0.8, markersize=6) ax.fill(angles, values, alpha=0.1, color=row['Color']) # Customize radar chart ax.set_xticks(angles[:-1]) ax.set_xticklabels(categories, fontsize=11, fontweight='bold') ax.set_ylim(0, 100) ax.set_yticks([20, 40, 60, 80, 100]) ax.set_yticklabels(['20%', '40%', '60%', '80%', '100%'], fontsize=10) ax.grid(True, alpha=0.3) # Add title and legend plt.title('Figure 4: Vendor Vulnerability Specialization Profile\n(Radar Chart Analysis)', fontsize=14, fontweight='bold', pad=30) plt.legend(loc='upper right', bbox_to_anchor=(1.3, 1.0), fontsize=10) plt.tight_layout() plt.savefig('Figure4_Vendor_Specialization_Radar.png', dpi=300, bbox_inches='tight', facecolor='white', edgecolor='none') plt.savefig('Figure4_Vendor_Specialization_Radar.eps', dpi=300, bbox_inches='tight', facecolor='white', edgecolor='none') plt.show() # ================================================================= # FIGURE 5: Statistical Analysis and Correlations # ================================================================= fig5, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 12)) fig5.suptitle('Figure 5: Statistical Analysis of Vendor CWE Patterns', fontsize=16, fontweight='bold', y=0.98) # Subplot A: Coverage vs Specialization Correlation x = summary_df['Coverage_Percent'] y = summary_df['Specialization_Percent'] # Calculate correlation correlation = np.corrcoef(x, y)[0, 1] ax1.scatter(x, y, c=summary_df['Color'], s=100, alpha=0.8, edgecolors='black') # Add trend line z = np.polyfit(x, y, 1) p = np.poly1d(z) ax1.plot(x, p(x), "r--", alpha=0.8, linewidth=2) ax1.set_xlabel('CWE Coverage Percentage (%)', fontsize=12, fontweight='bold') ax1.set_ylabel('Specialization Percentage (%)', fontsize=12, fontweight='bold') ax1.set_title(f'(A) Coverage vs Specialization\n(r = {correlation:.3f})', fontsize=13, fontweight='bold') ax1.grid(alpha=0.3, linestyle='--') # Add vendor labels for _, row in summary_df.iterrows(): ax1.annotate(row['Vendor'], (row['Coverage_Percent'], row['Specialization_Percent']), xytext=(5, 5), textcoords='offset points', fontsize=9) # Subplot B: CVE Volume Distribution # --- Subplot B: CVE Volume Distribution (Histogram with Vendor Labels) --- n_bins = 10 # Adjust bin count as needed counts, bins, patches = ax2.hist( summary_df['Total_CVEs'], bins=n_bins, color='skyblue', edgecolor='black', alpha=0.7 ) # Label vendors in each bin with vertical stacking for i, (bin_start, bin_end) in enumerate(zip(bins[:-1], bins[1:])): vendors_in_bin = summary_df[ (summary_df['Total_CVEs'] >= bin_start) & (summary_df['Total_CVEs'] < bin_end) ]['Vendor'].tolist() # Fix the upper bound issue: include right edge for last bin if i == len(bins) - 2: vendors_in_bin += summary_df[ (summary_df['Total_CVEs'] == bin_end) ]['Vendor'].tolist() # Place each vendor label slightly above the bar with vertical stacking for j, vendor in enumerate(vendors_in_bin): ax2.text( (bin_start + bin_end) / 2, # x: center of the bin counts[i] + 0.2 + (j * 0.3), # y: staggered vertically vendor, ha='center', va='bottom', fontsize=8, bbox=dict(boxstyle="round,pad=0.2", edgecolor='gray', facecolor='white', alpha=0.6) ) ax2.set_xlabel('Total CVEs', fontsize=12, fontweight='bold') ax2.set_ylabel('Frequency (Vendor Count)', fontsize=12, fontweight='bold') ax2.set_title('(B) CVE Volume Distribution', fontsize=13, fontweight='bold') ax2.grid(alpha=0.3, linestyle='--') # Subplot C: CWE Diversity vs CVE Volume ax3.scatter(summary_df['Total_CVEs'], summary_df['Unique_CWEs'], c=summary_df['Color'], s=100, alpha=0.8, edgecolors='black') # Add trend line x_vol = summary_df['Total_CVEs'] y_div = summary_df['Unique_CWEs'] correlation2 = np.corrcoef(x_vol, y_div)[0, 1] z2 = np.polyfit(x_vol, y_div, 1) p2 = np.poly1d(z2) ax3.plot(x_vol, p2(x_vol), "r--", alpha=0.8, linewidth=2) ax3.set_xlabel('Total CVEs', fontsize=12, fontweight='bold') ax3.set_ylabel('Unique CWEs', fontsize=12, fontweight='bold') ax3.set_title(f'(C) Volume vs Diversity\n(r = {correlation2:.3f})', fontsize=13, fontweight='bold') ax3.grid(alpha=0.3, linestyle='--') # Add vendor labels for _, row in summary_df.iterrows(): ax3.annotate(row['Vendor'], (row['Total_CVEs'], row['Unique_CWEs']), xytext=(5, 5), textcoords='offset points', fontsize=9) # Subplot D: Vendor Ranking Matrix metrics = ['Coverage_Percent', 'Total_CVEs', 'Specialization_Percent'] metric_names = ['Coverage %', 'Total CVEs', 'Specialization %'] # Normalize metrics to 0-1 scale for comparison normalized_data = summary_df[metrics].copy() for col in metrics: normalized_data[col] = (normalized_data[col] - normalized_data[col].min()) / (normalized_data[col].max() - normalized_data[col].min()) # Create ranking heatmap im = ax4.imshow(normalized_data.T, cmap='RdYlGn', aspect='auto', alpha=0.8) # Set ticks and labels ax4.set_xticks(range(len(summary_df))) ax4.set_xticklabels(summary_df['Vendor'], rotation=45, ha='right') ax4.set_yticks(range(len(metric_names))) ax4.set_yticklabels(metric_names) # Add text annotations for i in range(len(metric_names)): for j in range(len(summary_df)): text = ax4.text(j, i, f'{normalized_data.iloc[j, i]:.2f}', ha="center", va="center", color="black", fontweight='bold') ax4.set_title('(D) Normalized Metric Comparison', fontsize=13, fontweight='bold') # Add colorbar cbar = plt.colorbar(im, ax=ax4, shrink=0.8) cbar.set_label('Normalized Score (0-1)', fontsize=10, fontweight='bold') plt.tight_layout() plt.savefig('Figure5_Statistical_Analysis.png', dpi=300, bbox_inches='tight', facecolor='white', edgecolor='none') plt.savefig('Figure5_Statistical_Analysis.eps', dpi=300, bbox_inches='tight', facecolor='white', edgecolor='none') plt.show() # ================================================================= # Print Summary Statistics for Academic Paper # ================================================================= print("\n" + "="*80) print("ACADEMIC SUMMARY STATISTICS") print("="*80) print(f"Dataset Overview:") print(f"• Total unique CWEs analyzed: {vendor_results['total_cwes']}") print(f"• Number of vendors/platforms: {len(summary_df)}") print(f"• Total CVEs across all vendors: {summary_df['Total_CVEs'].sum():,}") print(f"\nCoverage Statistics:") print(f"• Mean CWE coverage: {summary_df['Coverage_Percent'].mean():.1f}% (SD: {summary_df['Coverage_Percent'].std():.1f}%)") print(f"• Highest coverage: {summary_df['Coverage_Percent'].max():.1f}% ({summary_df.loc[summary_df['Coverage_Percent'].idxmax(), 'Vendor']})") print(f"• Lowest coverage: {summary_df['Coverage_Percent'].min():.1f}% ({summary_df.loc[summary_df['Coverage_Percent'].idxmin(), 'Vendor']})") print(f"\nSpecialization Statistics:") print(f"• Mean specialization: {summary_df['Specialization_Percent'].mean():.1f}% (SD: {summary_df['Specialization_Percent'].std():.1f}%)") print(f"• Most specialized: {summary_df['Specialization_Percent'].max():.1f}% ({summary_df.loc[summary_df['Specialization_Percent'].idxmax(), 'Vendor']})") print(f"• Least specialized: {summary_df['Specialization_Percent'].min():.1f}% ({summary_df.loc[summary_df['Specialization_Percent'].idxmin(), 'Vendor']})") print(f"\nCorrelation Analysis:") print(f"• Coverage vs Specialization: r = {correlation:.3f}") print(f"• Volume vs Diversity: r = {correlation2:.3f}") print("\n" + "="*80) print("GENERATED ACADEMIC FIGURES:") print("="*80) print("• Figure1_Vendor_CWE_Comprehensive_Analysis.png/.eps") print("• Figure2_CWE_Vendor_Heatmap.png/.eps") print("• Figure3_CWE_Category_Distribution.png/.eps") print("• Figure4_Vendor_Specialization_Radar.png/.eps") print("• Figure5_Statistical_Analysis.png/.eps") print("\nAll figures saved in both PNG (presentation) and EPS (publication) formats.") return { 'summary_stats': summary_df, 'category_analysis': category_df, 'specialization_metrics': spec_df, 'correlations': {'coverage_specialization': correlation, 'volume_diversity': correlation2} } # Execute visualization creation # Note: This function should be called with the vendor_results from the previous analysis print("Academic CWE Visualization Suite Ready!") print("Usage: visualization_results = create_academic_cwe_visualizations(corrected_vendor_results)") print("\nThis will generate 5 publication-quality figures suitable for academic papers.") # After running the corrected vendor analysis: visualization_results = create_academic_cwe_visualizations(fixed_results) # This generates all 5 figures and returns statistical summary
Academic CWE Visualization Suite Ready! Usage: visualization_results = create_academic_cwe_visualizations(corrected_vendor_results) This will generate 5 publication-quality figures suitable for academic papers. Creating Academic-Quality CWE Visualizations...
The PostScript backend does not support transparency; partially transparent artists will be rendered opaque.
The PostScript backend does not support transparency; partially transparent artists will be rendered opaque.
The PostScript backend does not support transparency; partially transparent artists will be rendered opaque.
The PostScript backend does not support transparency; partially transparent artists will be rendered opaque.
================================================================================ ACADEMIC SUMMARY STATISTICS ================================================================================ Dataset Overview: • Total unique CWEs analyzed: 733 • Number of vendors/platforms: 5 • Total CVEs across all vendors: 50,270 Coverage Statistics: • Mean CWE coverage: 42.9% (SD: 12.7%) • Highest coverage: 60.6% (GitHub Open-Source) • Lowest coverage: 27.7% (RedHat Commercial) Specialization Statistics: • Mean specialization: 12.4% (SD: 4.0%) • Most specialized: 16.8% (Cisco) • Least specialized: 7.2% (RedHat Commercial) Correlation Analysis: • Coverage vs Specialization: r = 0.444 • Volume vs Diversity: r = 0.958 ================================================================================ GENERATED ACADEMIC FIGURES: ================================================================================ • Figure1_Vendor_CWE_Comprehensive_Analysis.png/.eps • Figure2_CWE_Vendor_Heatmap.png/.eps • Figure3_CWE_Category_Distribution.png/.eps • Figure4_Vendor_Specialization_Radar.png/.eps • Figure5_Statistical_Analysis.png/.eps All figures saved in both PNG (presentation) and EPS (publication) formats.