Top > Database > Administration > MC
The MC program: 1. Recursively descends directories, finding text files 2. Processes files selectively through full regular expression matching of file names. 3. Builds a sparse matrix of word/token counts. The particular sprse marix format used is given here. 4. Processes any user specified text formats(email address or URLs) as a whole token through regular expression matching or FLEX definition. 5. Prunes vocabulary by word length and frequency 6. Excludes user specified stop words 7. Sets word vector weights according any of the txx, txn, tfn, tfx, lxx, lxn, lfn, lfx scaling schemes. 8. Writes all data structures to disk in the Compressed Column Storage format.
The application does not have English parsing or
part-of-speech tagging facilities or complete documentation
Obtaining
|
|
User READEM available in HTML format from http://www.cs.utexas.edu/users/jfan/dm/README.htmlSupport contacts
Help List | <jfan@cs.utexas.edu> |
Developer List | <jfan@cs.utexas.edu> |
Bug List | <jfan@cs.utexas.edu> |
Maintainers |
|
Developers |
|
Sponsors |
|
Interfaces | command line |
Source languages | C++ |
Build prerequisites | FLEX, STL, pthread library |
License verified by | Janet Casey <jcasey@gnu.org> on 2001-07-02 |
Entry compiled by | James Fan <jfan@cs.utexas.edu> |
Categories
The copyright licensing notice below applies to this text. The software described in this text has its own copyright notice and license, which can usually be found in the distribution itself.
Copyright © 2000, 2001, 2002, 2003 Free Software Foundation, Inc.
Permission is granted to copy, distribute, and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with no Invariant Sections, with no Front-Cover Texts, and with no Back-Cover Texts. A copy of this license is included in the file COPYING.DOC.
Please report any problems in this page to bug-directory@gnu.org, or find out how you can help fix them.
The FSF provides this directory as a service to the free software community. Please consider donating to the FSF to help support this project.