Skip to content

Data Dictionary and Schema Overview

Dataset Overview

This documentation describes the schema, sources, and column definitions for all tables in the unified vulnerability database. The goal is to support data analysts in exploring and interpreting complex security datasets. All tables are loaded into a DuckDB database and sourced from trusted industry feeds.

Database Scale & Statistics

Metric Value
Uncompressed Size 24 GB
Compressed Size 5 GB
Primary Engine DuckDB (Analytical)
Format Columnar Parquet
pie title CVE-Product/Package Pairs Distribution
    "Red Hat (9M)" : 9000000
    "Cisco (385K)" : 385000
    "GitHub (297K)" : 297000
    "Microsoft (185K)" : 185000
    "MoreFixes (35K)" : 35276

Trusted Industry Feeds

  • MITRE CVE-V5 & NVD - Primary vulnerability data
  • CISA KEV & SSVC - Government threat intelligence
  • ExploitDB - Community exploit repository
  • Microsoft MSRC - Vendor security advisories
  • Red Hat Security - Enterprise Linux advisories
  • Cisco PSIRT - Network infrastructure advisories
  • GitHub Security - Open source package advisories
  • MoreFixes Academic - Research-grade fix commits

Table Overview & Relationships

graph TB
    subgraph "Core Vulnerability Data"
        CVE[cve_main<br/>Primary CVE Repository]
        EXPLOITS[exploits<br/>ExploitDB Public Exploits]
    end

    subgraph "Commercial Vendor Patches"
        MSRC[msrc_patches<br/>Microsoft: 185K pairs]
        CISCO[cisco_patches<br/>Cisco: 385K pairs]
        REDHAT[redhat_patches<br/>Red Hat: 9M pairs]
    end

    subgraph "Open Source Intelligence"
        GITHUB[github_advisories<br/>GitHub: 297K pairs]
        MOREFIXES[morefixes_*<br/>Academic: 35K commits]
    end

    subgraph "Reference Data"
        CWE[cwe_ref<br/>MITRE CWE Catalog]
        CAPEC[capec_ref<br/>MITRE CAPEC Patterns]
    end

    CVE --> EXPLOITS
    CVE --> MSRC
    CVE --> CISCO
    CVE --> REDHAT
    CVE --> GITHUB
    CVE --> MOREFIXES
    CVE --> CWE
    CVE --> CAPEC

    style CVE fill:#e1f5fe
    style EXPLOITS fill:#fff3e0
    style MSRC fill:#e8f5e8
    style CISCO fill:#e8f5e8
    style REDHAT fill:#e8f5e8
    style GITHUB fill:#f3e5f5
    style MOREFIXES fill:#f3e5f5

Core Tables

cve_main - Primary CVE Repository

Central Vulnerability Database

Sources: MITRE CVE-V5, NVD, CISA KEV, CISA SSVC, ExploitDB
Primary Key: cve_id
Record Type: Unique CVE entries

Column Type Description
id Integer Internal unique identifier
cve_id String CVE ID from MITRE/NVD (e.g., CVE-2023-12345)
assigner_org String Organization that assigned the CVE
state String CVE state: PUBLISHED or REJECTED
description Text CVE description text
Column Type Description
date_reserved Date When the CVE ID was reserved
date_published Date When the CVE was published
date_updated Date Last update timestamp

Scoring Priority: v4 → v3 → v2

Column Type Description
cvss_v2_score Float CVSS v2 score (0.0–10.0)
cvss_v2_vector String Vector string for CVSS v2
cvss_v3_score Float CVSS v3 score
cvss_v3_vector String CVSS v3 vector
cvss_v3_severity String Severity level (LOW, MEDIUM, HIGH, CRITICAL)
cvss_v4_score Float CVSS v4 score
cvss_v4_vector String CVSS v4 vector string
cvss_v4_severity String CVSS v4 severity classification
Column Type Description
cwe_ids String Comma-separated list of CWE IDs
cpes String Comma-separated CPEs affected
vendors String Vendors (can be imprecise)
products String Products (can be imprecise)
references Text List of external references (URLs)

Government Threat Assessment

Column Type Description
ssvc_exploitation String SSVC status: "active", "poc", or "none"
ssvc_automatable Boolean Is the issue automatable?
ssvc_technical_impact String Level of technical impact
kev_known_exploited Boolean 1 if listed in CISA KEV
kev_vendor_project String Vendor/project name from KEV
kev_product String Product name from KEV
kev_vulnerability_name String Title from CISA KEV advisory
kev_date_added Date Date added to KEV list
kev_short_description Text Short description from KEV
kev_required_action String Recommended mitigation action
kev_due_date Date Deadline for remediation
kev_ransomware_use Boolean Indicates ransomware involvement
kev_notes Text Any notes from KEV advisory
kev_cwes String Associated CWE(s) in KEV
Column Type Description
epss_score Float EPSS likelihood score (0-1)
epss_percentile Float EPSS global percentile
has_exploit Boolean 1 if exploit exists in ExploitDB
exploit_count Integer Number of exploits in ExploitDB for this CVE
Column Type Description
data_sources String Comma-separated source indicators
created_at Timestamp DB timestamp (ignore)
updated_at Timestamp DB timestamp (ignore)
first_exploit_date Date Earliest exploit date (ignore)
latest_exploit_date Date Latest exploit date (ignore)

exploits - Public Exploit Repository

Threat Intelligence Data

Source: ExploitDB
Primary Key: id
Foreign Key: cve_idcve_main.cve_id
Record Type: Individual exploit entries

Column Type Description
id Integer Internal ID
file String File name or path to exploit code
description Text Description of the exploit
date_published Date Date of publication in ExploitDB
author String Author of the exploit
type String Exploit type (e.g., local, remote)
platform String Affected platform
port Integer Port used by exploit if applicable
date_added Date When added to DB
date_updated Date Last update to record
verified Boolean 1 if exploit was verified
codes Text Exploit code or snippets
tags String Tags (e.g., webapps, dos)
aliases String Alternative exploit names or IDs
screenshot_url String Screenshot of exploit demo if available
application_url String URL of affected application
source_url String Source reference
cve_id String Related CVE ID

Commercial Vendor Patches

msrc_patches - Microsoft Security Response Center

Microsoft Vendor Intelligence

Source: Microsoft MSRC (CVRF/CSAF)
Record Type: CVE-Product pairs
Volume: 185,000 pairs
Note: Single advisory may appear multiple times

Column Type Description
title String Title of the advisory
release_date Date Official patch release date
initial_release_date Date Initial release of the advisory
cvrf_id String Microsoft advisory ID
cve_id String CVE associated
Column Type Description
exploited_status Boolean 1 if exploited
exploitation_potential_lsr String Exploitation potential - Latest security release
exploitation_potential_osr String Exploitation potential - Other supported releases
publicly_disclosed Boolean 1 if publicly disclosed
Column Type Description
cvss_score Float CVSS score
cvss_vector String Vector string
vuln_title String Title of the vulnerability
cwe_ids String Comma-separated CWE list
Column Type Description
product_id String Internal product ID
product_name String Human-readable product name
product_branch String Specific version or release branch
product_cpe String CPE name for affected software
Column Type Description
threats Text Known threats or risks
remediations Text Mitigations or patch steps
notes Text Notes from Microsoft
acknowledgments Text Credit for discovery

redhat_patches - Red Hat Security Advisories

Enterprise Linux Intelligence

Source: Red Hat Security Advisory (CSAF)
Record Type: CVE-Product pairs
Volume: 9 million pairs
Coverage: Official Red Hat + Open Source + Third-party

Product Filtering Required

To filter for official Red Hat products, use these keywords in product_name or product_id:

rh, red hat, red-hat, rhel, enterprise linux, baseos, appstream, openshift

Column Type Description
id Integer Internal ID
advisory_id String Red Hat Advisory ID
title String Title of advisory
cve_id String CVE associated
cwe_id String CWE ID
vulnerability_title String Vulnerability title
Column Type Description
current_release_date Date Current release date
initial_release_date Date First release date
discovery_date Date When discovered
release_date Date Full release timeline
Column Type Description
status String Advisory status
version String Package version patched
publisher String "Red Hat" or other originator
publisher_category String Type of publisher (vendor, third-party, etc.)
Column Type Description
summary Text Summary of advisory
details Text Detailed description
cvss_score Float CVSS score
cvss_severity String Severity rating (LOW–CRITICAL)
cvss_vector String Vector string
threat_impact String Description of impact
aggregate_severity String Aggregated severity level
Column Type Description
product_id String Product ID (e.g. "3AS:openmotif-debuginfo-0:2.2.3")
product_name String Product name (e.g. "Red Hat Linux 7.1")

cisco_patches - Cisco Product Security

Network Infrastructure Intelligence

Source: Cisco PSIRT (CSAF)
Record Type: CVE-Product pairs
Volume: 385,000 pairs

Column Type Description
advisory_id String Cisco Advisory ID
title String Title of advisory
cve_id String CVE associated
vulnerability_title String Vulnerability title
Column Type Description
current_release_date Date Latest version date
initial_release_date Date First published
vulnerability_release_date Date Actual vulnerability disclosure date
Column Type Description
status String Advisory status
version String Advisory version
publisher String Cisco or partner
publisher_category String Type of publisher
Column Type Description
summary Text Summary text
details Text Detailed description
cvss_score Float CVSS score
cvss_severity String Severity label
cvss_vector String Vector string
bug_ids String Related bug trackers
Column Type Description
product_id String Cisco product ID
product_name String Human-readable product name
product_full_path String Full internal path to product
acknowledgments Text Credits
references Text External URLs
remediations Text Steps to fix

Open Source Intelligence

github_advisories - GitHub Security Database

Package Ecosystem Intelligence

Source: GitHub Advisory Database
Record Type: CVE-Package pairs
Volume: 297,000 pairs

Enhanced with Inferred Fields

Fields marked (inferred) are derived using keyword scanning and are not native to GitHub:

  • exploited - Based on terms like "actively exploited", "attacks observed"
  • poc_available - Based on PoC keywords or exploit database references
  • exploitability_level - 0–3 scale based on difficulty terms
  • patched/patch_available - Extracted from JSON affected.ranges.events
Column Type Description
id Integer Internal ID
ghsa_id String GitHub Security Advisory ID (GHSA-...)
schema_version String Version of advisory schema
published Timestamp Publish timestamp
modified Timestamp Last modification timestamp
Column Type Description
summary Text One-line summary
details Text Full description
primary_cve String Main CVE ID
all_cves String All related CVEs
references Text List of references (URLs)
Column Type Description
cvss_v3_score Float CVSS v3 score
cvss_v3_vector String CVSS v3 vector
cvss_v4_score Float CVSS v4 score
cvss_v4_vector String CVSS v4 vector
database_severity String Severity category (GitHub-defined)
severity_score Float Internal numeric score
cwe_ids String CWE IDs
Column Type Description
github_reviewed Boolean Boolean if reviewed by GitHub
github_reviewed_at Timestamp Timestamp of review
nvd_published_at Timestamp When added to NVD

ETL-Enhanced Fields

Column Type Description
exploited Boolean 1 if exploitation confirmed (inferred)
exploitability_level Integer 0–3 scale based on inferred difficulty
poc_available Boolean 1 if PoC or exploit publicly available (inferred)
patched Boolean 1 if patched (inferred)
patch_available Boolean 1 if patch reference is present (inferred)
Column Type Description
primary_ecosystem String Primary ecosystem (e.g., npm, pip)
all_ecosystems String All affected ecosystems
package_ecosystem String Package ecosystem (npm, pip, etc.)
package_name String Name of affected package
package_purl String Package URL identifier (PURL)
affected_ranges String Version range strings affected
affected_versions String Specific affected versions

Academic Research Data

MoreFixes Dataset Family

Academic Research Dataset

Source: MoreFixes Dataset (JafarAkhondali et al., 2024)
Paper: ACM Digital Library
Repository: GitHub
Scale: 29,203 unique CVEs from 7,238 GitHub projects

Dataset Characteristics

This dataset contains the largest collection of CVE vulnerability data with fix commits available today:

  • 35,276 unique commits linked to vulnerabilities
  • 39,931 patch commit files that fixed those vulnerabilities
  • Some patch files couldn't be saved as SQL due to technical constraints
graph TB
    subgraph "MoreFixes Schema"
        CVE_MF[morefixes_cve<br/>CVE Metadata]
        FIXES[morefixes_fixes<br/>CVE-Commit Links]
        COMMITS[morefixes_commits<br/>Commit Details]
        REPOS[morefixes_repository<br/>Repository Info]
        FILES[morefixes_file_change<br/>File-level Changes]
        METHODS[morefixes_method_change<br/>Method-level Changes]
        CWE_CLASS[morefixes_cwe_classification<br/>CVE-CWE Mapping]
        CWE_MF[morefixes_cwe<br/>CWE Reference]
    end

    CVE_MF --> FIXES
    FIXES --> COMMITS
    COMMITS --> REPOS
    COMMITS --> FILES
    FILES --> METHODS
    CVE_MF --> CWE_CLASS
    CWE_CLASS --> CWE_MF

    style CVE_MF fill:#e3f2fd
    style FIXES fill:#f3e5f5
    style COMMITS fill:#e8f5e8
    style FILES fill:#fff3e0

CVE Metadata Table

Column Type Description
cve_id String CVE ID
published_date Date CVE publication date
last_modified_date Date Last modified date
description Text CVE summary
nodes String Affected OS/software versions
severity String Severity level
obtain_all_privilege Boolean Boolean flag
obtain_user_privilege Boolean Boolean flag
obtain_other_privilege Boolean Boolean flag
user_interaction_required Boolean Boolean flag
cvss2_vector_string String CVSSv2 vector string
cvss3_vector_string String CVSSv3 vector string

CVE-Commit Relationship Table

Column Type Description
cve_id String CVE ID
hash String Commit hash
repo_url String Git repository URL
rel_type String Commit relation type
score Integer Score based on relevance (e.g., 1337)

Commit Details Table

Column Type Description
hash String Commit hash
repo_url String Repository URL
author String Author name
msg Text Commit message
num_lines_added Integer Lines added
num_lines_deleted Integer Lines deleted
author_date Timestamp Timestamp

Repository Metadata Table

Column Type Description
repo_url String Repository URL
repo_name String Name of repo
description Text Short description
date_created Date When repo created
date_last_push Date When last pushed
owner String Owner username

File-Level Changes Table

Column Type Description
file_change_id String Unique ID
hash String Commit hash
filename String File name
change_type String ADD/DELETE/MODIFY
diff Text Git diff
code_before Text Code snippet before fix
code_after Text Code snippet after fix

Method-Level Changes Table

Column Type Description
method_change_id String Unique ID
file_change_id String FK to file change
name String Function/method name
code Text Full method code
complexity Integer Cyclomatic complexity

CVE-CWE Mapping Table

Column Type Description
cve_id String CVE ID
cwe_id String CWE ID

CWE Reference Table

Column Type Description
cwe_id String CWE ID
cwe_name String Name of weakness
description Text Short description
is_category Boolean Boolean

Reference Tables

cwe_ref - Common Weakness Enumeration

MITRE CWE Knowledge Base

Source: MITRE CWE
Primary Key: cwe_id
Usage: Weakness classification and pattern analysis

Column Type Description
cwe_id String Unique identifier for the CWE entry (e.g., CWE-79)
name String Short title of the weakness (e.g., Cross-site Scripting)
weakness_abstraction String Generalization level: Base, Variant, Class, etc.
status String Current status: Draft, Incomplete, Deprecated, etc.
description Text Brief description of the weakness
extended_description Text Full textual explanation including implications, context, examples
Column Type Description
related_weaknesses String List of associated or parent/child CWE IDs
weakness_ordinalities String Ordering of weakness by nature or relevance
applicable_platforms String Technology platforms (e.g., Web, IoT, Mobile)
background_details Text Additional background for understanding
alternate_terms String Other names or aliases for this weakness
Column Type Description
modes_of_introduction Text How the weakness typically arises in development
exploitation_factors Text Factors affecting how this weakness can be exploited
likelihood_of_exploit String Qualitative probability of exploitation (e.g., High, Medium)
common_consequences Text Typical impacts such as DoS, Data Disclosure
Column Type Description
detection_methods Text How the weakness is typically detected (e.g., SAST)
potential_mitigations Text Recommended mitigation techniques
observed_examples Text CVE examples linked to this weakness
Column Type Description
functional_areas String Functional areas affected, such as Authentication
affected_resources String System elements affected (e.g., Database, Web Layer)
taxonomy_mappings Text Linked taxonomies such as OWASP Top 10
related_attack_patterns String CAPEC IDs related to the weakness
notes Text Editorial or historical notes
created_at Timestamp Timestamp of record creation

Example CWE Entry

cwe_id name weakness_abstraction status description
CWE-79 Improper Neutralization of Input Variant Complete The software does not neutralize or incorrectly neutralizes user-controllable input before it is placed in output that is used as a web page that is served to other users.

capec_ref - Common Attack Pattern Enumeration

MITRE CAPEC Attack Patterns

Source: MITRE CAPEC
Primary Key: capec_id
Usage: Attack methodology analysis and CWE correlation

Column Type Description
capec_id String Unique CAPEC identifier (e.g., CAPEC-1)
name String Name of the attack pattern (e.g., Accessing Functionality Not Properly Constrained by ACLs)
abstraction String Abstraction level (Standard, Meta, etc.)
status String Entry status (e.g., Draft, Complete)
description Text Brief description of the attack pattern
alternate_terms String Other terms used for the same pattern
Column Type Description
likelihood_of_attack String Qualitative assessment (e.g., High, Medium)
typical_severity String Expected severity if the attack is successful (e.g., High)
related_attack_patterns Text List of CAPEC relationships (e.g., ChildOf, CanPrecede)
Column Type Description
execution_flow Text Sequence of actions for attack execution
prerequisites Text Conditions required for the attack to succeed
skills_required Text Skill level necessary for an attacker
resources_required Text Tools or resources an attacker needs
Column Type Description
indicators Text Observable signs of this attack pattern
consequences Text Potential impacts on confidentiality, integrity, availability
mitigations Text Recommendations to prevent or mitigate the attack
Column Type Description
example_instances Text Known real-world examples of this pattern
related_weaknesses String CWE IDs related to this CAPEC pattern
taxonomy_mappings Text External taxonomy associations (e.g., ATT&CK)
notes Text Editorial, contextual, or historical notes
created_at Timestamp Timestamp of record creation

Example CAPEC Entry

capec_id name abstraction status description
CAPEC-1 Accessing Functionality Not Properly Constrained by ACLs Standard Draft Attackers access web functionality not protected by ACLs due to misconfiguration or missing access control. This can lead to privilege escalation or unauthorized actions.

Data Quality & Processing Notes

Important Data Considerations

Critical Data Quality Notes

CVSS and EPSS Handling

  • If any cvss_*_score or epss_score/epss_percentile value is -1, this indicates the score was not available from the source
  • Use priority ordering: CVSS v4 → v3 → v2 for analysis
  • EPSS scores range from 0-1 (probability of exploitation)

Exploitation Timeline Constraints

  • Primary temporal data: Only found in exploits table (date_published from ExploitDB)
  • Other datasets: Lack explicit timestamps for when exploitation was observed
  • kev_date_added: Reflects when CISA became aware of exploit, not when exploitation began
  • Analysis impact: Limited ability to create precise exploitation timelines

Red Hat Product Classification

To ensure only official Red Hat products are selected from redhat_patches, filter using these keywords in product_name or product_id:

WHERE (
    LOWER(product_name) REGEXP '(rh|red.hat|red-hat|rhel|enterprise.linux|baseos|appstream|openshift)'
    OR LOWER(product_id) REGEXP '(rh|red.hat|red-hat|rhel|enterprise.linux|baseos|appstream|openshift)'
)

GitHub Advisory Inferred Fields

The following fields in github_advisories are not native to GitHub but derived via ETL keyword detection:

Field Detection Method Confidence Level
exploited Keywords: "actively exploited", "attacks observed", "exploitation detected" Medium
poc_available References to ExploitDB, PoC repositories, demonstration code High
exploitability_level Terms: "trivial", "complex", difficulty indicators (0-3 scale) Low-Medium
patched JSON parsing of affected.ranges.events for "fixed" events High
patch_available Presence of patch references, fix commits, or remediation links High

Record Duplication Patterns

Expected Duplication Scenarios

CVE-Product/Package Pairs

All vendor patch tables contain intentional duplication where a single CVE affects multiple products:

graph LR
    CVE[CVE-2023-12345] --> P1[Product A v1.0]
    CVE --> P2[Product A v2.0]
    CVE --> P3[Product B v1.5]
    CVE --> P4[Product C v3.2]

    style CVE fill:#ffebee
    style P1 fill:#e8f5e8
    style P2 fill:#e8f5e8
    style P3 fill:#e8f5e8
    style P4 fill:#e8f5e8

Examples of legitimate duplication: - Microsoft: Single advisory covering multiple Windows versions - Red Hat: CVE affecting multiple RHEL releases and packages - Cisco: Network vulnerability across multiple router models - GitHub: Package vulnerability affecting multiple ecosystem versions

Dataset Comparison Guidelines

Ecosystem Context Considerations

Different Data Contexts - Do Not Directly Compare

Commercial Vendor Patches (msrc_patches, cisco_patches, filtered redhat_patches): - Vendor-provided security advisories - Formal patch release processes - Coordinated disclosure timelines - Commercial support obligations

Open Source Intelligence (github_advisories, morefixes_*, unfiltered redhat_patches): - Community-driven fix processes - Package manager ecosystems (npm, PyPI, Maven) - Variable disclosure coordination - Diverse fix quality and timing

Appropriate Comparison Strategies

Valid Comparisons: - Commercial vendors against each other (Microsoft vs Cisco vs Red Hat official) - Open source ecosystems against each other (npm vs PyPI packages) - Temporal trends within the same ecosystem - Severity distributions within vendor categories

Invalid Comparisons: - Direct commercial vs open source response times - Package fix times vs enterprise patch cycles - Mixed ecosystem aggregated metrics - Cross-ecosystem severity correlations without context

Query Performance & Optimization

Database Query Guidelines

Optimized for Analytical Workloads

-- Efficient temporal analysis with proper indexing
SELECT 
    EXTRACT(YEAR FROM date_published) as year,
    COUNT(*) as cve_count,
    AVG(cvss_v3_score) as avg_severity
FROM cve_main 
WHERE date_published >= '2020-01-01'
    AND cvss_v3_score > 0
GROUP BY 1
ORDER BY 1;

-- Multi-vendor patch comparison with proper filtering
WITH vendor_patches AS (
    SELECT cve_id, release_date, 'Microsoft' as vendor 
    FROM msrc_patches
    UNION ALL
    SELECT cve_id, current_release_date, 'Cisco' 
    FROM cisco_patches
    UNION ALL
    SELECT cve_id, current_release_date, 'RedHat'
    FROM redhat_patches 
    WHERE LOWER(product_name) REGEXP 'rhel|enterprise.linux'
)
SELECT vendor, COUNT(*) as patch_count
FROM vendor_patches
GROUP BY vendor;

DuckDB Optimization Tips

  • Columnar Advantage: Aggregate queries perform exceptionally well
  • Date Filtering: Always use date ranges to limit scan scope
  • String Operations: REGEXP operations are optimized but still costly
  • JOIN Performance: Foreign key relationships are well-optimized
  • Memory Usage: Large text fields (descriptions, details) impact memory

Research Applications & Use Cases

mindmap
  root((Data Applications))
    Academic Research
      Vulnerability Lifecycle
        Time-to-Exploit Analysis
        Patch Response Metrics
        Cross-Vendor Comparison
      Economic Analysis
        Resource Allocation
        Cost-Benefit Models
        Risk Assessment
      Predictive Modeling
        ML Feature Engineering
        Temporal Validation
        Cross-Ecosystem Training
    Industry Intelligence
      Threat Assessment
        Exploitation Prediction
        Priority Ranking
        Risk Scoring
      Patch Management
        Response Planning
        Resource Optimization
        Vendor Evaluation
      Security Operations
        Threat Hunting
        Incident Response
        Vulnerability Triage
    Policy Research
      Disclosure Coordination
        Vendor Response Analysis
        Timeline Optimization
        Best Practice Development
      Regulatory Impact
        Compliance Metrics
        Policy Effectiveness
        Industry Standards

Data Validation & Quality Assurance

Quality Framework

ETL Quality Checks

  • Schema Validation: Data type and constraint verification
  • Referential Integrity: Foreign key relationship validation
  • Temporal Consistency: Date sequence and range validation
  • Duplicate Detection: Cross-source duplicate identification
  • Completeness Checks: Required field population validation

Human Quality Assurance

  • Sample Validation: Random record verification against sources
  • Edge Case Analysis: Unusual pattern investigation
  • Source Reconciliation: Cross-reference with original sources
  • Domain Expert Review: Security professional validation
  • Update Verification: Change tracking and validation

Acknowledged Data Constraints

Limitation Impact Mitigation Strategy
Temporal Gaps Limited exploitation timeline precision Use available data with clear disclaimers
Inferred Fields Potential false positives in GitHub data Validate with additional sources
Vendor Variations Inconsistent advisory formats Normalize during analysis
Scale Differences Varying data volumes across sources Use proportional analysis methods
Update Latency Daily update cycles Account for freshness in analysis

Data Dictionary Summary

This comprehensive data dictionary provides the foundation for advanced vulnerability research across commercial and open source ecosystems. The unified schema enables sophisticated analysis while maintaining awareness of data quality considerations and appropriate usage patterns.

Key Takeaways:

  • Scale: 24GB uncompressed, 5GB compressed with millions of CVE-product pairs
  • Sources: Trusted industry feeds with proper attribution and validation
  • Quality: Rigorous ETL processes with automated and manual validation
  • Flexibility: Supports both academic research and practical security applications
  • Limitations: Clear documentation of constraints and appropriate usage