Package ca.cgjennings.algo
Class TextIndexer.DefaultTextMapper
- java.lang.Object
-
- ca.cgjennings.algo.TextIndexer.DefaultTextMapper
-
- All Implemented Interfaces:
TextIndexer.TextMapper
- Enclosing class:
- TextIndexer
public static class TextIndexer.DefaultTextMapper extends java.lang.Object implements TextIndexer.TextMapper
A default text mapper implementation that assumes that the source IDs represent URLs. The returned indexed IDs are identical to the source IDs.
-
-
Constructor Summary
Constructors Constructor Description DefaultTextMapper()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description java.lang.String
getIndexID(java.lang.String sourceID)
Maps a source identifier to an index identifier.java.lang.String
getText(java.lang.String sourceID)
Given a source ID, return the text associated with that ID.protected java.lang.String
preprocess(java.lang.String sourceID, java.net.URL url, java.lang.String text)
Preprocesses the text after it is read but before it is returned to the caller ofgetText(java.lang.String)
.protected java.lang.String
read(java.lang.String sourceID, java.net.URL url, java.lang.String encodingHint)
Reads the source document from the URL and returns it as a string of indexable words.protected java.net.URL
toURL(java.lang.String sourceID)
Return a URL for the source ID.
-
-
-
Method Detail
-
getIndexID
public java.lang.String getIndexID(java.lang.String sourceID)
Description copied from interface:TextIndexer.TextMapper
Maps a source identifier to an index identifier. If the source ID should be identified differently in the index, this returns the version to include in the index.- Specified by:
getIndexID
in interfaceTextIndexer.TextMapper
- Parameters:
sourceID
- the ID used to locate the text during indexing- Returns:
- the ID used to locate the text when using the index
-
getText
public java.lang.String getText(java.lang.String sourceID) throws java.io.IOException
Given a source ID, return the text associated with that ID. The default mapper does this by callingtoURL(java.lang.String)
on the source ID, reading and then preprocessing the result.- Specified by:
getText
in interfaceTextIndexer.TextMapper
- Parameters:
sourceID
- an identifier that the mapper uses to locate the text- Returns:
- the text mapped to by the ID
- Throws:
java.io.IOException
- if an I/O error occurs while fetching the document
-
toURL
protected java.net.URL toURL(java.lang.String sourceID) throws java.io.IOException
Return a URL for the source ID. The default implementation simply returns a new URL using the source ID as if bynew URL(sourceID)
.- Parameters:
sourceID
- returns a URL for the source ID- Returns:
- a URL to use to read the source text
- Throws:
java.io.IOException
- if an error occurs while creating the URL
-
read
protected java.lang.String read(java.lang.String sourceID, java.net.URL url, java.lang.String encodingHint) throws java.io.IOException
Reads the source document from the URL and returns it as a string of indexable words.- Parameters:
sourceID
- the identifier of the documenturl
- the URL to read the document fromencodingHint
- the name of an encoding, ornull
to use a default encoding- Returns:
- the document text
- Throws:
java.io.IOException
- if an error occurs while reading the document
-
preprocess
protected java.lang.String preprocess(java.lang.String sourceID, java.net.URL url, java.lang.String text)
Preprocesses the text after it is read but before it is returned to the caller ofgetText(java.lang.String)
. The default implementation returns the text unchanged.- Parameters:
sourceID
- the identifier of the documenturl
- the URL that the document was read fromtext
- the original text- Returns:
- the modified text
-
-