# HG changeset patch # User Gustavo Picon # Date 1282505439 18000 # Node ID 415efab93c3f60c7ffead90724bdb0ed613e87bb # Parent aeba0e85798665ab0e652b110b526a4fd22698c0 refactoring and added machine tag support diff -r aeba0e85798665ab0e652b110b526a4fd22698c0 -r 415efab93c3f60c7ffead90724bdb0ed613e87bb docs/api.rst --- a/docs/api.rst Sun Jun 06 14:18:04 2010 -0500 +++ b/docs/api.rst Sun Aug 22 14:30:39 2010 -0500 @@ -4,14 +4,14 @@ .. module:: tagtools .. moduleauthor:: Gustavo Picon -.. inheritance-diagram:: Serializer -.. autoclass:: Serializer +.. inheritance-diagram:: Tokenizer +.. autoclass:: Tokenizer :show-inheritance: Provides methods to subclass tagging serializers. - Must not be used directly, use a subclass (:class:`FlickrSerializer`, - :class:`DeliciousSerializer` or :class:`CommaSerializer`) instead. + Must not be used directly, use a subclass (:class:`FlickrTokenizer`, + :class:`DeliciousTokenizer` or :class:`CommaTokenizer`) instead. The subclasses are not designed to be instantiated, they contains only class and static methods. @@ -22,9 +22,9 @@ If more than one tag have the same normalized form, only the first tag will be included in the resulting list. So for instance, if - using the :class:`CommaSerializer` subclass:: + using the :class:`CommaTokenizer` subclass:: - CommaSerializer.str2tags("TaG, tag, TAG") + CommaTokenizer.str2tags("TaG, tag, TAG") would return:: @@ -42,11 +42,11 @@ .. note:: - By default, all Serializers will call `.lower()` on the + By default, all Tokenizers will call `.lower()` on the given `tag`. You can change this behavior either by further subclassing or composition, like:: - class MySerializer(CommaSerializer): + class MyTokenizer(CommaTokenizer): @staticmethod def normalize(tag): diff -r aeba0e85798665ab0e652b110b526a4fd22698c0 -r 415efab93c3f60c7ffead90724bdb0ed613e87bb docs/comma.rst --- a/docs/comma.rst Sun Jun 06 14:18:04 2010 -0500 +++ b/docs/comma.rst Sun Aug 22 14:30:39 2010 -0500 @@ -1,16 +1,16 @@ -CommaSerializer +CommaTokenizer =============== .. currentmodule:: tagtools .. moduleauthor:: Gustavo Picon -.. inheritance-diagram:: CommaSerializer -.. autoclass:: CommaSerializer +.. inheritance-diagram:: CommaTokenizer +.. autoclass:: CommaTokenizer :show-inheritance: Example:: - CommaSerializer.str2tags('Tag 1, Tag2, TAG 1, Tag3') + CommaTokenizer.str2tags('Tag 1, Tag2, TAG 1, Tag3') returns:: @@ -18,7 +18,7 @@ and:: - CommaSerializer.tags2str(['tag1', 'tag2', 'tag3']) + CommaTokenizer.tags2str(['tag1', 'tag2', 'tag3']) returns:: diff -r aeba0e85798665ab0e652b110b526a4fd22698c0 -r 415efab93c3f60c7ffead90724bdb0ed613e87bb docs/conf.py --- a/docs/conf.py Sun Jun 06 14:18:04 2010 -0500 +++ b/docs/conf.py Sun Aug 22 14:30:39 2010 -0500 @@ -46,9 +46,9 @@ # built documents. # # The short X.Y version. -version = '0.8c' +version = '0.8d' # The full version, including alpha/beta/rc tags. -release = '0.8c' +release = '0.8d' # The language for content autogenerated by Sphinx. Refer to documentation # for a list of supported languages. diff -r aeba0e85798665ab0e652b110b526a4fd22698c0 -r 415efab93c3f60c7ffead90724bdb0ed613e87bb docs/delicious.rst --- a/docs/delicious.rst Sun Jun 06 14:18:04 2010 -0500 +++ b/docs/delicious.rst Sun Aug 22 14:30:39 2010 -0500 @@ -1,16 +1,16 @@ -DeliciousSerializer +DeliciousTokenizer =================== .. currentmodule:: tagtools .. moduleauthor:: Gustavo Picon -.. inheritance-diagram:: DeliciousSerializer -.. autoclass:: DeliciousSerializer +.. inheritance-diagram:: DeliciousTokenizer +.. autoclass:: DeliciousTokenizer :show-inheritance: Example:: - DeliciousSerializer.str2tags('Tag1 Tag2 TAG1 Tag3') + DeliciousTokenizer.str2tags('Tag1 Tag2 TAG1 Tag3') returns:: @@ -18,7 +18,7 @@ and:: - DeliciousSerializer.tags2str(['tag1', 'tag2', 'tag3']) + DeliciousTokenizer.tags2str(['tag1', 'tag2', 'tag3']) returns:: diff -r aeba0e85798665ab0e652b110b526a4fd22698c0 -r 415efab93c3f60c7ffead90724bdb0ed613e87bb docs/flickr.rst --- a/docs/flickr.rst Sun Jun 06 14:18:04 2010 -0500 +++ b/docs/flickr.rst Sun Aug 22 14:30:39 2010 -0500 @@ -1,16 +1,16 @@ -FlickrSerializer +FlickrTokenizer ================ .. currentmodule:: tagtools .. moduleauthor:: Gustavo Picon -.. inheritance-diagram:: FlickrSerializer -.. autoclass:: FlickrSerializer +.. inheritance-diagram:: FlickrTokenizer +.. autoclass:: FlickrTokenizer :show-inheritance: Example:: - FlickrSerializer.str2tags('"Tag 1" Tag2 "TAG 1" Tag3') + FlickrTokenizer.str2tags('"Tag 1" Tag2 "TAG 1" Tag3') returns:: @@ -18,7 +18,7 @@ and:: - FlickrSerializer.tags2str(['tag 1', 'tag2', 'tag3']) + FlickrTokenizer.tags2str(['tag 1', 'tag2', 'tag3']) returns:: diff -r aeba0e85798665ab0e652b110b526a4fd22698c0 -r 415efab93c3f60c7ffead90724bdb0ed613e87bb docs/index.rst --- a/docs/index.rst Sun Jun 06 14:18:04 2010 -0500 +++ b/docs/index.rst Sun Aug 22 14:30:39 2010 -0500 @@ -11,10 +11,11 @@ - **Flexible**: Includes 3 different tag implementations with the same API: - 1. Flickr (:class:`FlickrSerializer`) - 2. Delicious (:class:`DeliciousSerializer`) - 3. Comma separated tags (:class:`CommaSerializer`) + 1. Flickr (:class:`FlickrTokenizer`) + 2. Delicious (:class:`DeliciousTokenizer`) + 3. Comma separated tags (:class:`CommaTokenizer`) +- **Powerful**: Manages multi-dimensional tags (machine tags). - **Customizable**: Handles customizable per-tag normalization to avoid tag duplicates. - **Easy**: Simple :doc:`API ` diff -r aeba0e85798665ab0e652b110b526a4fd22698c0 -r 415efab93c3f60c7ffead90724bdb0ed613e87bb tagtools.py --- a/tagtools.py Sun Jun 06 14:18:04 2010 -0500 +++ b/tagtools.py Sun Aug 22 14:30:39 2010 -0500 @@ -1,9 +1,62 @@ +import re -__version__ = '0.8c' +__version__ = '0.8d' +RE_MACHINE_TAG = re.compile(r""" + ^ # begin + ([a-z][a-z0-9_]*) # namespace + \: # separator + ([a-z][a-z0-9_]*) # predicate + \= # separator + (.+) # value + $ # the end """, re.VERBOSE) -class Serializer(object): + +class Tag: + "Tag objects" + + def __init__(self, raw_tag): + self.raw = raw_tag.strip() + self.is_machinetag = False + self.namespace, self.predicate, self.value = None, None, None + self.parse() + + def parse(self): + self.clean = self.normalize(self.raw) + + if ':' in self.raw and '=' in self.raw: + mmatch = RE_MACHINE_TAG.match(self.raw) + if mmatch: + self.is_machinetag = True + self.namespace, self.predicate, value = mmatch.groups() + self.value = self.normalize(value) + + @staticmethod + def normalize(tag): + """ Normalizes a single tag. + + :param tag: A single tag, as a string. It is assumed that the tag has + no leading/trailing whitespace. + + :returns: A normalized version of the tag. + """ + return tag.lower() + + +class Tokenizer(object): SEPARATOR = JOINER = TAGS_WITH_SPACES = None + TAGCLASS = Tag + + @classmethod + def _process_tag(cls, tags, keys, strtag): + tag = cls.TAGCLASS(strtag) + cleantag = tag.clean + if cleantag and cleantag not in keys: + # Ignore if the normalized tag is empty or if there is + # already a tag with the same normalized value. + # TaG, TAG, tag, taG ==> TaG + tags.append(tag) + keys.add(cleantag) @classmethod def str2tags(cls, tagstr): @@ -11,43 +64,30 @@ :param tagstr: A string with tags as entered by a user on a form. - :returns: A list of tuples. Each tuple represents a tag and has - two elements: - - - The normalized tag. Normalization is done by the - :meth:`normalize` static method. - - The raw tag as was entered, but without leading/trailing - whitespace. + :returns: A list of Tag objects. If you subclass Tag, set your subclass + in the TAGCLASS property. """ if not tagstr: return [] tags, keys = [], set() - for tag in tagstr.split(cls.SEPARATOR): - tag = tag.strip() - cleantag = cls.normalize(tag) - if not cleantag or cleantag in keys: - # Ignore if the normalized tag is empty or if there is - # already a tag with the same normalized value. - # TaG, TAG, tag, taG ==> TaG - continue - tags.append((cleantag, tag)) - keys.add(cleantag) + for strtag in tagstr.split(cls.SEPARATOR): + cls._process_tag(tags, keys, strtag) return tags @classmethod def tags2str(cls, tags): """ Takes a list of tags and returns a string that can be edited. - :param tags: A list of tags that are correct for the Serializer being - used. For instance, when using :class:`CommaSerializer`, + :param tags: A list of tags that are correct for the Tokenizer being + used. For instance, when using :class:`CommaTokenizer`, tags can't have commas on them. :returns: A string that, if serialized, would return the same tags. :raise TagWithSeparatorException: - * if a tag has a space when using :class:`DeliciousSerializer`, or - * a tag has a comma when using :class:`CommaSerializer` + * if a tag has a space when using :class:`DeliciousTokenizer`, or + * a tag has a comma when using :class:`CommaTokenizer` """ results = [] for tag in tags: @@ -57,20 +97,10 @@ results.append(tag) return cls.JOINER.join(results) - @staticmethod - def normalize(tag): - """ Normalizes a single tag. Called by :meth:`str2tags` - :param tag: A single tag, as a string. It is assumed that the tag has - no leading/trailing whitespace. - :returns: A normalized version of the tag. - """ - return tag.lower() - - -class DeliciousSerializer(Serializer): - """ Serializer for Delicious-like tags. +class DeliciousTokenizer(Tokenizer): + """ Tokenizer for Delicious-like tags. Delicious tags are separated by spaces, and don't allow spaces in a tag. @@ -80,8 +110,8 @@ TAGS_WITH_SPACES = False -class CommaSerializer(Serializer): - """ Serializer for comma-separated tags. +class CommaTokenizer(Tokenizer): + """ Tokenizer for comma-separated tags. Comma separated tags don't allow commas in a tag. @@ -92,8 +122,8 @@ TAGS_WITH_SPACES = True -class FlickrSerializer(Serializer): - """ Serializer for Flickr-like tags. +class FlickrTokenizer(Tokenizer): + """ Tokenizer for Flickr-like tags. Flickr tags are separated by spaces. If a tag has spaces, it must be enclosed with double quotes. @@ -108,20 +138,10 @@ if not tagstr: return [] if '"' not in tagstr: - return super(FlickrSerializer, cls).str2tags(tagstr) + return super(FlickrTokenizer, cls).str2tags(tagstr) lstr = list(tagstr.strip()) tags, keys, tok, prev, quoted = [], set(), '', '', False - def addtok(tok): - "adds a valid token (tag) to both the tags list and the keys set" - tok = tok.strip() - cleantok = cls.normalize(tok) - if cleantok and cleantok not in keys: - # don't add the tag if it's invalid (empty) or if the - # normalized value is already in the keys set - tags.append((cleantok, tok)) - keys.add(cleantok) - while lstr: char = lstr[0] if char == '"': @@ -131,7 +151,7 @@ (quoted and prev == '"' and '"' not in lstr)): if tok: quoted = False - addtok(tok) + cls._process_tag(tags, keys, tok) tok = '' else: tok += char @@ -139,7 +159,7 @@ del lstr[0] tok = tok.strip() if tok: - addtok(tok) + cls._process_tag(tags, keys, tok) return tags @classmethod diff -r aeba0e85798665ab0e652b110b526a4fd22698c0 -r 415efab93c3f60c7ffead90724bdb0ed613e87bb tests.py --- a/tests.py Sun Jun 06 14:18:04 2010 -0500 +++ b/tests.py Sun Aug 22 14:30:39 2010 -0500 @@ -2,16 +2,47 @@ """Test tagtools.py""" -from tagtools import FlickrSerializer, DeliciousSerializer, CommaSerializer, \ +from tagtools import FlickrTokenizer, DeliciousTokenizer, CommaTokenizer, \ TagWithSeparatorException import unittest -class TestFlickrSerializer(unittest.TestCase): +class TagToolTestCase(unittest.TestCase): + def _test(self, tagstr, expected): + got = self.serializer.str2tags(tagstr) + r = [] + for tag in got: + if tag.is_machinetag: + r.append((tag.clean, tag.raw, tag.namespace, tag.predicate, + tag.value)) + else: + r.append((tag.clean, tag.raw)) + self.assertEqual(len(got), len(expected)) + for tag, exp in zip(got, expected): + if len(exp) == 5: + clean, raw, namespace, predicate, value = exp + is_machine = True + else: + if len(exp) == 2: + clean, raw = exp + else: + clean, raw = None, None + namespace, predicate, value = None, None, None + is_machine = False + self.assertEqual(is_machine, tag.is_machinetag) + self.assertEqual(clean, tag.clean) + self.assertEqual(raw, tag.raw) + self.assertEqual(namespace, tag.namespace) + self.assertEqual(predicate, tag.predicate) + self.assertEqual(value, tag.value) + +class TestFlickrTokenizer(TagToolTestCase): + + def setUp(self): + self.serializer = FlickrTokenizer def test_flickr_str2tags(self): - def test(tagstr, expected): - self.assertEqual(expected, FlickrSerializer.str2tags(tagstr)) + test = self._test # I know these tests look weird, but I actually tried all of them # in flickr. This is how flickr does tagging. @@ -59,10 +90,18 @@ ('2', '2'), ('tag3', 'tag3')]) test('TaG taG GAT tag gat', [('tag', 'TaG'), ('gat', 'GAT')]) test('"TaG" taG GAT "tag" g"a"t', [('tag', 'TaG'), ('gat', 'GAT')]) + test( + 'tag1 tag2:foo tag3:bar=baz tag4:aa="a b c d" " tag5:bb="e f g h', + [('tag1', 'tag1'), ('tag2:foo', 'tag2:foo'), + ('tag3:bar=baz', 'tag3:bar=baz', 'tag3', 'bar', 'baz'), + ('tag4:aa=a b c d', 'tag4:aa=a b c d', 'tag4', 'aa', 'a b c d'), + ('tag5:bb=e', 'tag5:bb=e', 'tag5', 'bb', 'e'), + ('f', 'f'), ('g', 'g'), ('h', 'h')]) + def test_flickr_tags2str(self): def test(tags, expected): - self.assertEqual(expected, FlickrSerializer.tags2str(tags)) + self.assertEqual(expected, FlickrTokenizer.tags2str(tags)) test([], '') test(['t1'], 't1') test(['t1', 't2', 't3'], 't1 t2 t3') @@ -83,12 +122,13 @@ 'tag1 tag number 2 tag3') -class TestDeliciousSerializer(unittest.TestCase): +class TestDeliciousTokenizer(TagToolTestCase): + + def setUp(self): + self.serializer = DeliciousTokenizer def test_delicious_str2tags(self): - def test(tagstr, expected): - self.assertEqual(expected, DeliciousSerializer.str2tags(tagstr)) - + test = self._test test(None, []) test('', []) test(' ', []) @@ -103,19 +143,22 @@ def test_delicious_tags2str(self): def test(tags, expected): - self.assertEqual(expected, DeliciousSerializer.tags2str(tags)) + self.assertEqual(expected, DeliciousTokenizer.tags2str(tags)) test([], '') test(['t1'], 't1') test(['t1', 't2', 't3'], 't1 t2 t3') test(['t1', 't2', 't3'], 't1 t2 t3') self.assertRaises(TagWithSeparatorException, - DeliciousSerializer.tags2str, ['t 1']) + DeliciousTokenizer.tags2str, ['t 1']) -class TestCommaSerializer(unittest.TestCase): +class TestCommaTokenizer(TagToolTestCase): + + def setUp(self): + self.serializer = CommaTokenizer + def test_comma_str2tags(self): - def test(tagstr, expected): - self.assertEqual(expected, CommaSerializer.str2tags(tagstr)) + test = self._test test(None, []) test('', []) test(',', []) @@ -134,13 +177,13 @@ def test_comma_tags2str(self): def test(tags, expected): - self.assertEqual(expected, CommaSerializer.tags2str(tags)) + self.assertEqual(expected, CommaTokenizer.tags2str(tags)) test([], '') test(['t1'], 't1') test(['t1', 't2', 't3'], 't1, t2, t3') test(['t 1', 't 2', 't 3'], 't 1, t 2, t 3') self.assertRaises(TagWithSeparatorException, - CommaSerializer.tags2str, ['t,1']) + CommaTokenizer.tags2str, ['t,1']) if __name__ == "__main__": # pragma: no cover