Treffer: Automatically Generating Rules of Malicious Software Packages via Large Language Model

Title:
Automatically Generating Rules of Malicious Software Packages via Large Language Model
Contributors:
Zhang, Xiangrui, Li, Qiang
Publisher Information:
Zenodo
Publication Year:
2025
Collection:
Zenodo
Subject Terms:
Document Type:
E-Ressource software
Language:
English
DOI:
10.5281/zenodo.15278659
Rights:
Accession Number:
edsbas.3D9EC554
Database:
BASE

Weitere Informationen

Malware Detection Rule Generator A Python-based tool that automatically generates, refines, and validates malware detection rules from code samples. The system supports both YARA and Semgrep rule generation, using advanced language models to analyze malicious code patterns. ## Features - **Automated Rule Generation**: Analyzes malicious code samples to generate detection rules - **Multiple Rule Types**: Supports both YARA and Semgrep rule formats - **Intelligent Clustering**: Groups similar malware samples for more effective rule generation - **Rule Refinement**: Improves initial rules through pattern analysis and optimization - **Validation System**: Automatically validates and fixes generated rules - **Progress Tracking**: Maintains detailed logs of the generation process ## Installation 1. Clone the repository: ```bash git clone <repository-url> cd malware-rule-generator ``` 2. Install required dependencies: ```bash pip install -r requirements.txt ``` Required dependencies include: - openai - yara-python - semgrep - torch - transformers - scikit-learn - numpy - tqdm ## Configuration Create a `config.ini` file in the project root: ```ini [Settings] Model = # e.g., gpt-4-0125-preview ModelApiKey = <your-api-key> # Your API key BaseURL = # API base URL, such as https://api.openai.com/v1 ``` ## Workflow The system employs a three-phase process to generate effective malware detection rules: 1. **Clustering & Preprocessing**: - Uses `cluster_malware.py` to analyze and cluster similar malware samples - Leverages CodeBERT to generate embeddings for code samples - Applies K-means clustering to group similar malware - Filters large and heavily encoded samples to improve processing efficiency 2. **Rule Generation Process**: - Selects samples from each cluster - `Generator` analyzes samples and creates initial rules - `Refiner` optimizes rules to improve detection efficiency and reduce false positives - `Fixer` validates rule format and fixes any syntax errors 3. **Output & Logging**: - Generates ...