Reproduction Guide¶
This guide provides step-by-step instructions for reproducing the vendor-specific CWE analysis research. The analysis can be performed using either parquet files or a DuckDB database file.
Reproduction Overview
Goal: Reproduce the comprehensive analysis of 50,270 CVEs across 733 unique CWEs for five major technology vendors.
Expected Time: 2-4 hours (depending on data source and system performance)
Skill Level: Intermediate Python, basic SQL knowledge recommended
🖥️ System Requirements¶
Minimum Requirements¶
Hardware Specifications
- RAM: 8GB minimum, 16GB recommended
- Storage: 10GB available space for data and outputs
- CPU: Multi-core processor recommended for large dataset processing
- OS: Windows 10+, macOS 10.14+, or Linux (Ubuntu 18.04+)
Software Dependencies¶
Required Software
- Python 3.8+ with pip package manager
- Jupyter Lab/Notebook for interactive analysis
- Git for repository management (optional)
📥 Data Setup Options¶
You have two options for accessing the research data:
Using Parquet Files
This is the original research approach, offering maximum flexibility and transparency.
Advantages: - Full transparency of data processing steps - Ability to modify queries and analysis - Individual table access for targeted analysis - Compatible with various analytics tools
Data Files Required:
parquet_data/
├── mysql_cve.parquet # Main CVE data
├── mysql_exploit.parquet # Exploit information
├── mysql_msrc_vuln_unified.parquet # Microsoft patches
├── mysql_cisco_vuln_unified.parquet # Cisco advisories
├── mysql_redhat_vuln_unified.parquet # RedHat patches
├── mysql_github_advisory_unified.parquet # GitHub advisories
├── mysql_cwe.parquet # CWE reference data
└── mysql_capec.parquet # CAPEC reference data
🛠️ Environment Setup¶
Step 1: Create Python Environment¶
Virtual Environment Setup
Step 2: Install Dependencies¶
Package Installation
# Core analytics packages
pip install duckdb>=0.9.0
pip install pandas>=1.5.0
pip install numpy>=1.24.0
# Visualization packages
pip install matplotlib>=3.6.0
pip install seaborn>=0.12.0
pip install plotly>=5.15.0
# Jupyter environment
pip install jupyterlab>=4.0.0
pip install notebook>=7.0.0
pip install ipywidgets>=8.0.0
# Statistical analysis
pip install scipy>=1.11.0
pip install scikit-learn>=1.3.0
# Optional: Performance enhancement
pip install modin[all] # For faster pandas operations
Step 3: Verify Installation¶
Installation Verification
📊 Data Acquisition¶
Option 1: Download Parquet Files¶
Parquet Files Download
# Create data directory
mkdir -p parquet_data
cd parquet_data
# Download individual parquet files
# (Replace URLs with actual data source locations)
wget -O mysql_cve.parquet [CVE_DATA_URL]
wget -O mysql_exploit.parquet [EXPLOIT_DATA_URL]
wget -O mysql_msrc_vuln_unified.parquet [MSRC_DATA_URL]
wget -O mysql_cisco_vuln_unified.parquet [CISCO_DATA_URL]
wget -O mysql_redhat_vuln_unified.parquet [REDHAT_DATA_URL]
wget -O mysql_github_advisory_unified.parquet [GITHUB_DATA_URL]
wget -O mysql_cwe.parquet [CWE_REF_URL]
wget -O mysql_capec.parquet [CAPEC_REF_URL]
# Verify downloads
ls -la *.parquet
Option 2: Download DuckDB Database¶
Database File Download
Data Verification¶
Data Quality Check
import duckdb
import os
# For parquet files
if os.path.exists('parquet_data/mysql_cve.parquet'):
con = duckdb.connect(':memory:')
con.sql("CREATE VIEW cve_test AS SELECT * FROM 'parquet_data/mysql_cve.parquet'")
count = con.sql("SELECT COUNT(*) FROM cve_test").fetchone()[0]
print(f"CVE records loaded: {count:,}")
# For database file
if os.path.exists('cve_research.duckdb'):
con = duckdb.connect('cve_research.duckdb')
tables = con.sql("SHOW TABLES").fetchall()
print(f"Database tables: {[t[0] for t in tables]}")
🔬 Analysis Execution¶
Step 1: Launch Jupyter Environment¶
Jupyter Setup
Step 2: Configure Data Source¶
Data Configuration
Step 3: Execute Analysis¶
Run Analysis Sections
Section Order:
- Environment Setup - Import libraries and configure settings
- Data Loading - Connect to your chosen data source
- Data Quality Assessment - Verify data integrity
- Vendor CWE Analysis - Core analysis functions
- Statistical Analysis - Correlation and pattern analysis
- Visualization Generation - Create publication-quality figures
- Results Export - Save results and figures
Step 4: Monitor Progress¶
Expected Processing Times
- Data Loading: 2-5 minutes
- CWE Analysis: 10-15 minutes
- Statistical Analysis: 5-10 minutes
- Visualization Generation: 15-20 minutes
- Total Runtime: 30-50 minutes (depending on system)
📈 Expected Outputs¶
Generated Visualizations¶
Figure Outputs
The analysis will generate five publication-quality figures:
figures/
├── Figure1_Vendor_CWE_Comprehensive_Analysis.png
├── Figure2_CWE_Vendor_Heatmap.png
├── Figure3_CWE_Category_Distribution.png
├── Figure4_Vendor_Specialization_Radar.png
├── Figure5_Statistical_Analysis.png
└── statistical_concepts_explanation.png
Format: Both PNG (presentation) and EPS (publication) formats
Data Exports¶
CSV Outputs
results/
├── vendor_cwe_summary.csv # Summary statistics
├── microsoft_cwe_analysis.csv # Microsoft detailed results
├── cisco_cwe_analysis.csv # Cisco detailed results
├── redhat_commercial_analysis.csv # RedHat Commercial results
├── redhat_opensource_analysis.csv # RedHat Open-Source results
├── github_cwe_analysis.csv # GitHub detailed results
└── statistical_correlations.json # Correlation coefficients
Console Output Summary¶
Expected Terminal Output
=== VENDOR-SPECIFIC CWE ANALYSIS COMPLETE ===
Dataset Overview:
• Total unique CWEs analyzed: 733
• Number of vendors/platforms: 5
• Total CVEs across all vendors: 50,270
Coverage Statistics:
• Mean CWE coverage: 42.9% (SD: 12.7%)
• Highest coverage: 60.6% (GitHub Open-Source)
• Lowest coverage: 27.7% (RedHat Commercial)
Specialization Statistics:
• Mean specialization: 12.4% (SD: 4.0%)
• Most specialized: 16.8% (Cisco)
• Least specialized: 7.2% (RedHat Commercial)
Correlation Analysis:
• Coverage vs Specialization: r = 0.444
• Volume vs Diversity: r = 0.958
🔍 Validation Steps¶
Data Integrity Checks¶
Verification Checklist
✅ Data Loading Verification
# Check total CVE count
total_cves = con.sql("SELECT COUNT(*) FROM cve_main").fetchone()[0]
assert total_cves > 50000, f"Expected >50k CVEs, got {total_cves}"
# Check CWE diversity
unique_cwes = con.sql("""
SELECT COUNT(DISTINCT TRIM(UNNEST(STRING_SPLIT(cwe_ids, ','))))
FROM cve_main WHERE cwe_ids IS NOT NULL
""").fetchone()[0]
assert unique_cwes > 700, f"Expected >700 CWEs, got {unique_cwes}"
✅ Vendor Data Verification
# Verify vendor record counts
vendor_counts = {
'Microsoft': con.sql("SELECT COUNT(*) FROM msrc_patches").fetchone()[0],
'Cisco': con.sql("SELECT COUNT(*) FROM cisco_patches").fetchone()[0],
'RedHat': con.sql("SELECT COUNT(*) FROM redhat_patches").fetchone()[0],
'GitHub': con.sql("SELECT COUNT(*) FROM github_advisories").fetchone()[0]
}
for vendor, count in vendor_counts.items():
print(f"{vendor}: {count:,} records")
assert count > 0, f"No data found for {vendor}"
Statistical Validation¶
Result Validation
# Validate key statistical findings
correlations = analysis_results['correlations']
# Coverage vs Specialization correlation
assert 0.4 < correlations['coverage_specialization'] < 0.5, \
"Coverage-Specialization correlation outside expected range"
# Volume vs Diversity correlation
assert correlations['volume_diversity'] > 0.9, \
"Volume-Diversity correlation lower than expected"
print("✅ All statistical validations passed")
Visual Output Validation¶
Figure Verification
import os
expected_figures = [
'Figure1_Vendor_CWE_Comprehensive_Analysis.png',
'Figure2_CWE_Vendor_Heatmap.png',
'Figure3_CWE_Category_Distribution.png',
'Figure4_Vendor_Specialization_Radar.png',
'Figure5_Statistical_Analysis.png'
]
for figure in expected_figures:
filepath = f'figures/{figure}'
assert os.path.exists(filepath), f"Missing figure: {figure}"
# Check file size (should be substantial for high-quality figures)
size_mb = os.path.getsize(filepath) / (1024*1024)
assert size_mb > 0.1, f"Figure {figure} too small: {size_mb:.2f}MB"
print("✅ All figures generated successfully")
🐛 Troubleshooting¶
Common Issues and Solutions¶
Memory Issues
Problem: Out of memory errors during analysis
Solutions:
Data Loading Errors
Problem: Cannot load parquet files or database
Solutions:
Package Import Errors
Problem: Module not found errors
Solutions:
Performance Optimization¶
Speed Improvements
For Large Datasets:
Getting Help¶
Support Resources
Technical Issues: - Check the GitHub Issues for similar problems - Review DuckDB documentation for SQL query issues - Consult pandas documentation for data processing questions
Research Questions: - Contact: Eid.Albedah@city.ac.uk - Include system information and error logs - Specify which data source option you're using
Community Support: - Stack Overflow for technical programming questions - Reddit r/MachineLearning for methodology discussions - Academic conferences for research collaboration
🎯 Customization Options¶
Modifying the Analysis¶
Customization Possibilities
Add New Vendors:
# Add custom vendor analysis
def analyze_custom_vendor(con, vendor_table, cve_mapping):
# Implement vendor-specific analysis logic
pass
Adjust Time Periods:
# Filter by date range
date_filter = "WHERE date_published >= '2020-01-01' AND date_published <= '2024-12-31'"
Custom CWE Categories:
Extending the Visualizations¶
Custom Visualizations
📋 Reproduction Checklist¶
Final Checklist
Before considering reproduction complete, verify:
- Environment: Python 3.8+ with all required packages
- Data: Either parquet files or DuckDB database loaded successfully
- Analysis: All notebook cells executed without errors
- Outputs: Five publication-quality figures generated
- Results: CSV exports created with expected data
- Validation: Statistical results match expected ranges
- Documentation: Analysis log and results documented
Estimated Total Time: 2-4 hours Expected Output Size: ~100MB (figures + results)
For questions or issues with reproduction, contact: Eid.Albedah@city.ac.uk
Last updated: August 2025