Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What file formats should be supported for data and models? #30

Open
ryanbressler opened this issue Feb 4, 2014 · 18 comments
Open

What file formats should be supported for data and models? #30

ryanbressler opened this issue Feb 4, 2014 · 18 comments

Comments

@ryanbressler
Copy link
Owner

No description provided.

@ryanbressler
Copy link
Owner Author

Libsvm file format has been requested here:

#31

@ryanbressler
Copy link
Owner Author

ARFF and possibly unlabeled csv as commonly used by machine learning reopos

@ryanbressler
Copy link
Owner Author

Basic arff support is in and csv is supported now but only if you use it as a library since you need to define feature types.

Wondering if sparse arff and libsvm should be included and if a sparse feature representation is needed to do them well.

@ryanbressler
Copy link
Owner Author

@ryanbressler
Copy link
Owner Author

basic libsvm support is in

@tungntdhtl
Copy link

How can I grow a cloudRF with libsvm file? (I don't know which a target to declare).
e.g:
~/cloudRF/growforest -train usps.libsvm -rfpred usps.sf -target ??? -nTrees 1000
where usps.libsvm is a training data file.

@ryanbressler
Copy link
Owner Author

-target 0 should do it since the target is in the first column and their aren't column names

@tungntdhtl
Copy link

I received some errors as below:
~/cloudRF/growforest -train usps -rfpred usps.sf -target 0 -nTrees 500
Threads : 1
nTrees : 500
Loading data from: usps
panic: runtime error: index out of range

goroutine 1 [running]:
runtime.panic(0x8186020, 0x836d037)
/usr/local/go/src/pkg/runtime/panic.c:266 +0xac
github.com/ryanbressler/CloudForest.ParseAFM(0xb772bab8, 0x18600468, 0x836fd50)
/home/gm/golang/gopath/src/github.com/ryanbressler/CloudForest/featurematrix.go:294 +0xccd
github.com/ryanbressler/CloudForest.LoadAFM(0xbff7f407, 0x4, 0x0, 0x0, 0x0)
/home/gm/golang/gopath/src/github.com/ryanbressler/CloudForest/featurematrix.go:367 +0x2d4
main.main()
/home/gm/golang/gopath/src/github.com/ryanbressler/CloudForest/growforest/growforest.go:168 +0x1009

@ryanbressler
Copy link
Owner Author

You need to rename usps to usps.libsvm so that growforest knows how to parse it.

@ryanbressler
Copy link
Owner Author

Also do an update if you haven't as I recently fixed some small bugs with libsvm support.

@tungntdhtl
Copy link

Great! It is running.
You should write some comments abt this for CloudRF's users :)
Thanks Ryan!

@tungntdhtl
Copy link

ryanbressler commented "-target 0 should do it since the target is in the first column and their aren't column names"
How does CloudRF recognite the data type of the target response? (B:, N:, or C:)

@ryanbressler
Copy link
Owner Author

It checks to see if the first entry is an int or a float. Ints are handled
as C of B. Floats as N...if you want regression and the first entry is an
int just make sure it is written with a decimal point (ie 0.0 non 0)

On Mon, Apr 14, 2014 at 9:50 PM, tungntdhtl [email protected]:

ryanbressler commented "-target 0 should do it since the target is in the
first column and their aren't column names"
How does CloudRF recognite the data type of the target response? (B:, N:,
or C:)


Reply to this email directly or view it on GitHubhttps://github.com//issues/30#issuecomment-40443177
.

@tungntdhtl
Copy link

OK, thanks! That is a good way.
It also can read spare libsvm format file, right?
i.e. Xi and Yi represent such as col:value
e.g data with 100 features: 3 1:1 5:2.5 16:8 19:0.4 50:-1.2 55:1 72:4 85:6 90:3.2 98: 3.8 100: 6.2

@ryanbressler
Copy link
Owner Author

Yes, all unspecified features will be assumed to be zero.

On Mon, Apr 14, 2014 at 11:05 PM, tungntdhtl [email protected]:

OK, thanks! That is a good way.
It also can read spare libsvm format file, right?
i.e. Xi and Yi represent such as col:value
e.g data with 100 features: 3 1:1 5:2.5 16:8 19:0.4 50:-1.2 55:1 72:4 85:6
90:3.2 98: 3.8 100: 6.2


Reply to this email directly or view it on GitHubhttps://github.com//issues/30#issuecomment-40446130
.

@tungntdhtl
Copy link

In LIBSVM file containing lots of records (e.g 60,000,000), how can I build trees in couldRF?

I try setting a portion of total records using "nSamples=0.1" option, that means cloudRF works only 10% of total sample?
If yes, how can I take a bootstrap samples of total records using their portion? i.e. each tree grows from 10% of total records, each 10% records was random samples from total records

@ryanbressler
Copy link
Owner Author

Random forest bags samples independently for each tree so I think it is
already doing what you are asking for.

On Mon, Apr 14, 2014 at 11:59 PM, tungntdhtl [email protected]:

In LIBSVM file containing lots of records (e.g 60,000,000), how can I
build trees in couldRF?

I try setting a portion of total records using "nSamples=0.1" option, that
means cloudRF works only 10% of total sample?
If yes, how can I take a bootstrap samples of total records using their
portion? i.e. each tree grows from 10% of total records, each 10% records
was random samples from total records


Reply to this email directly or view it on GitHubhttps://github.com//issues/30#issuecomment-40448391
.

@tungntdhtl
Copy link

I mean RF struggles to build trees from large samples size because of a tree size is large.
In cloudRF, RF can grow from a portion of total records.
My question is that what is the scope of that portion? it uses all bagged records or just only small records independently (e.g 10%).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants