The syntax uses the scikit learn style with a .fit() function to train the algorithm and a .predict() function to predict new values class. You can find two examples in the jupyter notebook included in the repository. fromGCForest import*gcf = gcForest( **kwargs )gcf.fit(X_train, y_train)gcf.predict(X_test) Notes I wrote the code from scratch in two days and even though I have tested it on several cases I cannot certify that it is a 100% bug free obviously. Feel free to test it and send me your feedback about any improvement and/or modification! Known Issues Memory comsuption when slicing dataThere is now a short naive calculation illustrating the issue in the notebook. So far the input data slicing is done all in a single step to train the Random Forest for the Multi-Grain Scanning. The problem is that it might requires a lot of memory depending on the size of the data set and the number of slices asked resulting in memory crashes (at least on my Intel Core 2 Duo). I have recently improved the memory usage (from version 0.1.4) when slicing the data but will keep looking at ways to optimize the code. OOB score error During the Random Forests training the Out-Of-Bag (OOB) technique is used for the prediction probabilities. It was found that this technique can sometimes raises an error when one or several samples is/are used for all trees training. A potential solution consists in using cross validation instead of OOB score although it slows down the training. Anyway, simply increasing the number of trees and re-running the training (and crossing fingers) is often enough. Built With PyCharmcommunity edition memory_profilerlibra License This project is licensed under the MIT License (see LICENSE for details) Early Results (will be updated as new results come out) Scikit-learn handwritten digits classification : training time ~ 5min accuracy ~ 98% 部分代码: importitertools importnumpy asnp fromsklearn.ensemble importRandomForestClassifier fromsklearn.model_selection importtrain_test_split fromsklearn.metrics importaccuracy_score__author__ = "Pierre-Yves Lablanche" __email__ = "[email protected]" __license__ = "MIT" __version__ = "0.1.3" __status__ = "Development" # noinspection PyUnboundLocalVariable classgcForest(object):def__init__(self, shape_1X=None, n_mgsRFtree=30, window=None, stride=1, cascade_test_size=0.2, n_cascadeRF=2, n_cascadeRFtree=101, cascade_layer=np.inf, min_samples_mgs=0.1, min_samples_cascade=0.05, tolerance=0.0, n_jobs=1):""" gcForest Classifier. 关于规模 目前gcForest实现中的主要技术问题是在输入数据时的内存使用情况。真实的计算实际上可以让您了解算法将处理的对象的数量和规模。 计算C类[l,L]大小N维的问题,初始规模为: Slicing Step If my window is of size [wl,wL] and the chosen stride are [sl,sL] then the number of slices per sample is : Obviously the size of slice is [wl,wL]hence the total size of the sliced data set is : This is when the memory consumption is its peak maximum. Class Vector after Multi-Grain Scanning (责任编辑:本港台直播) |