码报:【j2开奖】周志华教授gcForest（多粒度级联森林）算法预测股指期货涨跌(2)_本港台直播_J2开奖直播

　　gcForestis an algorithm suggested in Zhou and Feng 2017. It uses a multi-grain scanning approach for data slicing and a cascade structure of multiple random forests layers (see paper for details).

　　gcForest has been first developed as a Classifier and designed such that the multi-grain scanning module and the cascade structure can be used separately. During development I've paid special attention to write the code in the way that future parallelization should be pretty straightforward to implement.

　　Prerequisites

　　The present code has been developed under python3.x. You will need to have the following installed on your computer to make it work :

Python 3.x

Numpy >= 1.12.0

Scikit-learn >= 0.18.1

jupyter >= 1.0.0 (only useful to run the tuto notebook)

　　You can install all of them using pip install :

　　$ pip3 install requirements.txtUsing gcForest

　　The syntax uses the scikit learn style with a .fit() function to train the algorithm and a .predict() function to predict new values class. You can find two examples in the jupyter notebook included in the repository.

　　fromGCForest import*gcf = gcForest( **kwargs )gcf.fit(X_train, y_train)gcf.predict(X_test) Notes

　　I wrote the code from scratch in two days and even though I have tested it on several cases I cannot certify that it is a 100% bug free obviously. Feel free to test it and send me your feedback about any improvement and/or modification!

　　Known Issues

　　Memory comsuption when slicing dataThere is now a short naive calculation illustrating the issue in the notebook. So far the input data slicing is done all in a single step to train the Random Forest for the Multi-Grain Scanning. The problem is that it might requires a lot of memory depending on the size of the data set and the number of slices asked resulting in memory crashes (at least on my Intel Core 2 Duo).

　　I have recently improved the memory usage (from version 0.1.4) when slicing the data but will keep looking at ways to optimize the code.

　　OOB score errorDuring the Random Forests training the Out-Of-Bag (OOB) technique is used for the prediction probabilities. It was found that this technique can sometimes raises an error when one or several samples is/are used for all trees training.

　　A potential solution consists in using cross validation instead of OOB score although it slows down the training. Anyway, simply increasing the number of trees and re-running the training (and crossing fingers) is often enough.

　　Built With

PyCharmcommunity edition

memory_profiler libra

　　License

　　This project is licensed under the MIT License (see LICENSE for details)

　　Early Results

　　(will be updated as new results come out)

Scikit-learn handwritten digits classification :

　　training time ~ 5min

　　accuracy ~ 98%

　　部分代码：

　　importitertools

　　importnumpy asnp

　　fromsklearn.ensemble importRandomForestClassifier

　　fromsklearn.model_selection importtrain_test_split

　　fromsklearn.metrics importaccuracy_score__author__ = "Pierre-Yves Lablanche"

　　__email__ = "[email protected]"

　　__license__ = "MIT"

　　__version__ = "0.1.3"

　　__status__ = "Development"

　　# noinspection PyUnboundLocalVariable

　　classgcForest(object):def__init__(self, shape_1X=None, n_mgsRFtree=30, window=None, stride=1, cascade_test_size=0.2, n_cascadeRF=2, n_cascadeRFtree=101, cascade_layer=np.inf, min_samples_mgs=0.1, min_samples_cascade=0.05, tolerance=0.0, n_jobs=1):""" gcForest Classifier.

　　关于规模

　　目前gcForest实现中的主要技术问题是在输入数据时的内存使用情况。真实的计算实际上可以让您了解算法将处理的对象的数量和规模。

　　计算C类[l，直播，L]大小N维的问题，初始规模为：

　　Slicing Step

　　If my window is of size [wl,wL] and the chosen stride are [sl,sL] then the number of slices per sample is :

　　Obviously the size of slice is [wl,wL]hence the total size of the sliced data set is :

　　This is when the memory consumption is its peak maximum.

　　Class Vector after Multi-Grain Scanning

(责任编辑：本港台直播)