本港台开奖现场直播 j2开奖直播报码现场
当前位置: 新闻频道 > IT新闻 >

码报:【j2开奖】周志华教授gcForest(多粒度级联森林)算法预测股指期货涨跌(2)

时间:2017-04-29 05:17来源:报码现场 作者:本港台直播 点击:
gcForest is an algorithm suggested in Zhou and Feng 2017. It uses a multi-grain scanning approach for data slicing and a cascade structure of multiple random forests layers (see paper for details). gc

  gcForestis an algorithm suggested in Zhou and Feng 2017. It uses a multi-grain scanning approach for data slicing and a cascade structure of multiple random forests layers (see paper for details).

  gcForest has been first developed as a Classifier and designed such that the multi-grain scanning module and the cascade structure can be used separately. During development I've paid special attention to write the code in the way that future parallelization should be pretty straightforward to implement.

  Prerequisites

  The present code has been developed under python3.x. You will need to have the following installed on your computer to make it work :

Python 3.x

Numpy >= 1.12.0

Scikit-learn >= 0.18.1

jupyter >= 1.0.0 (only useful to run the tuto notebook)

  You can install all of them using pip install :

  $ pip3 install requirements.txtUsing gcForest

  The syntax uses the scikit learn style with a .fit() function to train the algorithm and a .predict() function to predict new values class. You can find two examples in the jupyter notebook included in the repository.

  fromGCForest import*gcf = gcForest( **kwargs )gcf.fit(X_train, y_train)gcf.predict(X_test) Notes

  I wrote the code from scratch in two days and even though I have tested it on several cases I cannot certify that it is a 100% bug free obviously. Feel free to test it and send me your feedback about any improvement and/or modification!

  Known Issues

  Memory comsuption when slicing dataThere is now a short naive calculation illustrating the issue in the notebook. So far the input data slicing is done all in a single step to train the Random Forest for the Multi-Grain Scanning. The problem is that it might requires a lot of memory depending on the size of the data set and the number of slices asked resulting in memory crashes (at least on my Intel Core 2 Duo).

  I have recently improved the memory usage (from version 0.1.4) when slicing the data but will keep looking at ways to optimize the code.

  OOB score errorDuring the Random Forests training the Out-Of-Bag (OOB) technique is used for the prediction probabilities. It was found that this technique can sometimes raises an error when one or several samples is/are used for all trees training.

  A potential solution consists in using cross validation instead of OOB score although it slows down the training. Anyway, simply increasing the number of trees and re-running the training (and crossing fingers) is often enough.

  Built With

PyCharmcommunity edition

memory_profiler libra

  License

  This project is licensed under the MIT License (see LICENSE for details)

  Early Results

  (will be updated as new results come out)

Scikit-learn handwritten digits classification :

  training time ~ 5min

  accuracy ~ 98%

  部分代码:

  importitertools

  importnumpy asnp

  fromsklearn.ensemble importRandomForestClassifier

  fromsklearn.model_selection importtrain_test_split

  fromsklearn.metrics importaccuracy_score__author__ = "Pierre-Yves Lablanche"

  __email__ = "[email protected]"

  __license__ = "MIT"

  __version__ = "0.1.3"

  __status__ = "Development"

  # noinspection PyUnboundLocalVariable

  classgcForest(object):def__init__(self, shape_1X=None, n_mgsRFtree=30, window=None, stride=1, cascade_test_size=0.2, n_cascadeRF=2, n_cascadeRFtree=101, cascade_layer=np.inf, min_samples_mgs=0.1, min_samples_cascade=0.05, tolerance=0.0, n_jobs=1):""" gcForest Classifier.

  关于规模

  目前gcForest实现中的主要技术问题是在输入数据时的内存使用情况。真实的计算实际上可以让您了解算法将处理的对象的数量和规模。

  计算C类[l,直播,L]大小N维的问题,初始规模为:

  Slicing Step

  If my window is of size [wl,wL] and the chosen stride are [sl,sL] then the number of slices per sample is :

  Obviously the size of slice is [wl,wL]hence the total size of the sliced data set is :

  This is when the memory consumption is its peak maximum.

  Class Vector after Multi-Grain Scanning

(责任编辑:本港台直播)
顶一下
(0)
0%
踩一下
(0)
0%
------分隔线----------------------------
栏目列表
推荐内容