November 2007
Apache UIMA - Apache UIMA
by 1 other (via)Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user. UIMA is a framework and SDK for developing such applications. An example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at. UIMA enables such an application to be decomposed into components, for example "language identification" -> "language specific segmentation" -> "sentence boundary detection" -> "entity detection (person/place names etc.)". Each component must implement interfaces defined by the framework and must provide self-describing metadata via XML descriptor files. The framework manages these components and the data flow between them. Components are written in Java or C ; the data that flows between components is designed for efficient mapping between these languages. UIMA additionally provides capabilities to wrap components as network services, and can scale to very large volumes by replicating processing pipelines over a cluster of networked nodes.
Apache UIMA is an Apache-licensed open source implementation of the UIMA specification (that specification is, in turn, being developed concurrently by a technical committee within OASIS , a standards organization). We invite and encourage you to participate in both the implementation and specification efforts.
Getting started with OpenNLP (Natural Language Processing)
I found a great set of tools for natural language processing. The Java package includes a sentence detector, a tokenizer, a parts-of-speech (POS) tagger, and a treebank parser. It took me a little while to figure out where to start so I thought I'd post my findings here. I'm no linguist and I don't have previous experience with NLP, but hopefully this will help some one get setup with OpenNLP.
The OpenNLP Homepage
OpenNLP is an organizational center for open source projects related to natural language processing. Its primary role is to encourage and facilitate the collaboration of researchers and developers on such projects. Click here to see the current list of OpenNLP projects. We'll also try to keep a fairly up-to-date list of useful links related to NLP software in general.
OpenNLP also hosts a variety of java-based NLP tools which perform sentence detection, tokenization, pos-tagging, chunking and parsing, named-entity detection, and coreference using the OpenNLP Maxent machine learning package. To start using these tools download the latest release here, and check out the OpenNLP Tools API. For the latest news about these tools and to participate in discussions, check out OpenNLP's Sourceforge project page.
The Stanford NLP (Natural Language Processing) Group
(via)The Stanford NLP Group makes a number of pieces of NLP software available to the public. All these software distributions are licensed under the GNU Public License for non-commercial and research use. (Note that this is the full GPL, which allows its use for research purposes or other free software projects but does not allow its incorporation into any type of commercial software, even in part or in translation. Please contact us if you are interested in NLP software with commercial licenses.)
All the software we distribute is written in Java. Recent distributions require Sun JDK 1.5 (some of the older ones run on JDK 1.4). Distribution packages include components for command-line invocation, jar files, a Java API, and source code.
Category Theory for the Java Programmer « reperiendi
(via)There are several good introductions to category theory, each written for a different audience. However, I have never seen one aimed at someone trained as a programmer rather than as a computer scientist or as a mathematician. There are programming languages that have been designed with category theory in mind, such as Haskell, OCaml, and others; however, they are not typically taught in undergraduate programming courses. Java, on the other hand, is often used as an introductory language; while it was not designed with category theory in mind, there is a lot of category theory that passes over directly.
October 2007
Writing An Hadoop MapReduce Program In Python - Michael G. Noll
Even though the Hadoop framework is written in Java, programs for Hadoop need not to be coded in Java but can also be developed in other languages like Python or C (the latter since version 0.14.1). However, the documentation and the most prominent Python example on the Hadoop home page could make you think that you must translate your Python code using Jython into a Java jar file. Obviously, this is not very convenient and can even be problematic if you depend on Python features not provided by Jython. Another issue of the Jython approach is the overhead of writing your Python program in such a way that it can interact with Hadoop - just have a look at the example in ${HADOOP_INSTALL}/src/examples/python/WordCount.py and you see what I mean. I still recommend to have at least a look at the Jython approach and maybe even at the new C MapReduce API called Pipes, it's really interesting.
Having that said, the ground is prepared for the purpose of this tutorial: writing a Hadoop MapReduce program in a more Pythonic way, i.e. in a way you should be familiar with.
February 2007
Understanding JTA—the Java Transaction API: JDBC Topics: JDBC: Developer: Home
(via)The Java™ Transaction API (JTA) allows applications to perform distributed transactions, that is, transactions that access and update data on two or more networked computer resources. The JTA specifies standard Java interfaces between a transaction manager and the parties involved in a distributed transaction system: the application, the application server, and the resource manager that controls access to the shared resources affected by the transactions. This document provides an overview of that process and how the DataDirect Connect® for JDBC™ drivers relate to it.
October 2006
1
(9 marks)