![]() The purpose of the ember model is to provide comparison performance data and a jumping off point for future research. The ember model is not optimized, it is not constantly updated with new data, and performs worse than most production systems available. This is a research model and not a production model like Endgame MalwareScore®. As feature-less deep learning techniques work to match the performance of GBDTs in the domain of static malware classification, ember can provide a benchmark that measures their progress towards that goal.įigure 2: Ember model scores on the test set filesĭespite this ease of use, we have to advise against using the ember model as your antivirus engine. There are many easy ways to improve this score with the same GBDT algorithm including optimizing model parameters, running feature selection, or engineering better features. ![]() The area under the ROC curve is a good method for comparing binary classifiers and the ember benchmark model achieves a score of 0.9991123 on the test set. The performance of this model on the test set is shown in figure 2. The benchmark ember model is a gradient boosted decision tree (GBDT) trained with LightGBM with default model parameters. With this, anyone can download the benchmark model and then use the repository to classify new PE files. The process of how the derived features are calculated from the PE files is explicitly defined in code. A Jupyter notebook is also provided that generates graphics and performance information related to the benchmark model. The ember repository defines the software environment that the benchmark model was trained in and allows anyone to reproducibly train it themselves. On top of that challenge, attackers are actively searching for samples that fool your model.įigure 1: Distribution of dates that ember files were first seenĪlong with the data, we are releasing a repository on GitHub that makes it very easy to work with the data. Defenders must train a model at some point in time with all available information, but the goal is to best identify benign and malicious files that have not been seen yet. Including the date with each file and structuring the train/test split this way is important because of the evolving and adversarial nature of the static malware detection problem. ![]() A date histogram in figure 1 graphs the training data compared to the test data. Each sample includes the sha256 hash of the file, the month the files was first seen, a label, and features derived from the file. The 1.1 million samples include 900K training samples (300K malicious, 300K benign, 300K unlabeled) and 200K test samples (100K malicious, 100K benign). With this dataset, researchers can now quantify the effectiveness of new machine learning techniques against a well defined and openly available benchmark. Importantly, ember does NOT include the files themselves so that we can avoid releasing others’ intellectual property. The dataset includes metadata, derived features from the PE files, and a benchmark model trained on those features. Ember ( Endgame Malware BEnchmark for Research) is an open source collection of 1.1 million portable executable file (PE file) sha256 hashes that were scanned by VirusTotal sometime in 2017. Today, Endgame is releasing ember to address this lack of open-source datasets in the domain of static malware detection. ![]() Although there is no shortage of data in security, many applications of machine learning in the security industry lack similar benchmark datasets because of the presence of personally identifiable information, sensitive network infrastructure information, or private intellectual property. Advancements in hardware and rapidly growing datasets have been instrumental in this progress, as has the presence of public, open-source, benchmark datasets to track advancements in the field. Over the last decade, machine learning has achieved truly impressive results in fields such as optical character recognition, image labeling, and speech recognition. See Elastic Security to learn more about our integrated security solutions. Editor’s Note: Elastic joined forces with Endgame in October 2019, and has migrated some of the Endgame blog content to. ![]()
0 Comments
Leave a Reply. |