Text Creation Partnership (Redesigned Website)

The Text Creation Partnership (TCP) at the University of Michigan has recently launched its redesigned website. As its name suggests, TCP fosters collaborative efforts to create “accurately keyboarded and encoded editions of thousands of culturally significant works in all fields of scholarly and artistic endeavor.” That TCP works together with both the international library community and commercial publishers of scholarly electronic is one of its defining strengths. It is concerned not only with creating electronic texts in formats that keep pace with shifting technological changes but also with promoting access to texts. Its partnership projects with EEBO, ECCO, and Evans illustrate these commitments. Over 25,000 EEBO texts have already been encoded, and these texts will become part of the public domain on January 1, 2015. Aaron McCollough, Text Creation Partnership Project Outreach Librarian, has commented on this forthcoming access to these EEBO-TCP texts and also provided an example of what such access may look like in a recent comment to an earlier emob posting.

Among the features of TCP’s redesigned website that Aaron announced on the SHARP-L listserv, the following should especially interest readers of emob:

* regularly updated TCP “spotlights” on project milestones and related projects in research and scholarly application

* reviews of recently encoded texts

* fun with early modern print

As McCollough noted in his announcement, “we aim for it to be a place of encounter between students and scholars working in Early Modern fields of study, especially those interested in the role of digital archives in those fields.”

One can also follow TCP developments on the TCP News & Views blog. One of the recent announcements here and on the TCP website alerts users to the newly created The EEBO Introduction Series. This series provides bibliographical, contextual information, and more for less well-known early modern texts. Ten editions are now available, but access to them does require a subscription to EEBO.

This entry was posted on December 3, 2009 at 11:57 am and is filed under Digital Archives, ECCO, EEBO, Evans, TCP. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

23 Responses to “Text Creation Partnership (Redesigned Website)”

Anna Battigelli Says:
December 4, 2009 at 9:48 am | Reply
It is great to hear elsewhere on this blog from both Aaron McCollough (Project Outreach Librarian for the Text Creation Partnership) and Jo-Ann Hogan (ProQuest) that all of EEBO’s TCP texts will be freely accessible (keyboarded text only) by 2020. Though 2020 is a long way away (a substantial portion of a scholarly career), it is heartening to see that free access is a goal, if only for the TCP text. Because the TCP texts allow for full-text searching, access will be extremely helpful in scholarly digital projects.

The new TCP website provides a good overview of the project and is well worth exploring. A look at the TCP projects make clear that a portion of their funds often go into helping create these texts, something scholars probably need to consider as they write digital grants. As Jo-Anne Hogan mentions on this blog, free access requires financial support.

EEBO’s website also provides abundant information for how EEBO might be used, how its texts might be searched, and so forth. I did have one question, however. Can one access TCP’s keyboarded text from EEBO’s home page? I found I had to go to the TCP website.

LikeLike
amccollo Says:
December 4, 2009 at 11:50 am | Reply
Dear Anna—

To pick up on Jo-Anne’s point, I should say that virtually all of the TCP’s funds go to creating the texts. The only real exception being my salary (but my job is as much about trying to find more resources to make more texts as it is about anything). The TCP’s budget is always tight, mainly because the cost of manual re-keying is high.

Jo-Anne is right to stress the collaborative aspect of what the TCP is doing. In a very real sense, the 150+ partner institutions are bearing the financial burden of this process, and the will to participate depends on a shared vision for the future of digital scholarship. During an economic retreat like the one we’re now enduring, a project like the TCP is really a litmus test for institutional commitment to the future. Every school that decides the future is a good investment gets us closer to our goal. Likewise, every institution that shifts the burden to others impedes our progress.

Regarding your question: TCP keyboarded texts can be accessed from the ProQuest interface as long as your institution is a TCP partner. 25,000 texts have been encoded, and any EEBO search that pulls up a TCP-encoded text should also pull up a small “full text” icon in the record.

LikeLike
Anna Battigelli Says:
December 4, 2009 at 3:06 pm | Reply
When you refer to TCP’s “funds,” do you mean subscription fees, or the grant funds dedicated to TCP in grant proposals like the one funding Oxford’s Holinshed Project, or both? Having a more precise appreciation for TCP’s re-keying costs will help scholars have a fuller understanding of the cost of free access.

LikeLike
amccollo Says:
December 4, 2009 at 3:44 pm | Reply
The partnership model for the TCP is somewhat unique. All of the revenue comes from the fees charged to Institutions who become partners in the project. This is like a subscription model in some ways, but partner institutions are paying for immediate co-ownership rights over the texts. That means they can locally load the entire corpus whenever they’d like. The fee structure has five tiers (based on size of institution). Big schools pay more than small schools, but every contribution helps. The fees are explained in detail here: http://www.lib.umich.edu/tcp/eebo/partner.html#pricing

There is no real mechanism in place for directing grant funding toward the TCP, although we would welcome the help. With the exception of some assistance from the Delmas foundation (for putting on the 2006 TCP Conference), I don’t think the TCP has received any grant money.

Because the Holinshed project was produced at Oxford (a partner institution), there was no need for Ian Archer, et. al. to earmark grant money for TCP access in that case. Likewise, any other projects that might come up at Oxford or any other partner schools (listed here: http://www.lib.umich.edu/tcp/eebo/status.html) would have similar access. Once ProQuest begins offering TCP search-access as a subscription service, I suppose grant seekers at non-partner institutions could identify the service as a research cost. In that event, the TCP would still benefit (as ProQuest will give us 20% of the revenue on text-search subscriptions over the next five years).

As to how much keying costs, I can say that the complete project expenditure for keying and reviewing the first 25,000 EEBO texts came to approximately 6.8 million dollars. Annual keyboarding costs got as high as $792,000. This year, we expect something in the neighborhood of $700,000, increasing each year as we move deeper into the production phase of EEBO-TCP II.

When broken down in “per book” terms, the costs look like this:
Each book cost approximately $272 dollars to produce.
For each partner, the per-book cost was something like $2.

Obviously, more partners means cheaper books. Additional revenue streams (like grant money) would have the same effect, but we have not seen grant awarding agencies getting involved thus far.

LikeLike
amccollo Says:
December 4, 2009 at 3:48 pm | Reply
Note: the link to partner schools is broken by the end parenthesis. This is the working link:

http://www.lib.umich.edu/tcp/eebo/status.html

LikeLike
Eleanor Shevlin Says:
December 4, 2009 at 4:31 pm | Reply
Evidently (and quite typical of many data-entry projects), TCP uses offshore vendors for the double-keyed conversion process (see TCP production info). Parceling out this segment of text creation to oversees vendors (perhaps India? in the 1980s Ireland did a lot of this work) makes sense in terms of reducing costs, but I imagine the nature of EEBO texts would still make the process fairly costly. The off-shore entry costs I dealt with in the 1980s were something like $0.03 cents a record, but the records were mainly names and addresses. Current vendors list charges as ranging from $0.04 (potentially lower, too) to $16.00 (or higher) a record, depending on the nature, complexity of the record. I suspect that in the case of EEBO texts, “units” or “records” are calculated or defined differently from those used for more typical forms and documents or even modern textbooks.

LikeLike
amccollo Says:
December 4, 2009 at 5:50 pm | Reply
We are charged by the kilobyte. Recent bids have hovered around the .65 to .70 range per 1000 characters (aka per kilobyte).

LikeLike
- Eleanor Shevlin Says:
  December 4, 2009 at 5:57 pm | Reply
  Thanks–such a formula makes sense.
  
  LikeLike
Anna Battigelli Says:
December 4, 2009 at 6:25 pm | Reply
This is interesting, but we need more information.

I tried to figure out the price per book based on participating institutions.
EEBO-TCP has about 423 participating institutions paying between $12,500 and $50,000 for TCP, depending on their size. Let’s assume that averages to about $25,000/institution. That means that TCP subscriptions generate roughly 10.5 million dollars. That’s in addition to the revenue EEBO has brought in and in addition to grant monies forwarded to TCP to assist re-keying. If we divide the subscription fee total by 25,000 books, we get over $423/item to produce fully searchable, TEI-compliant SGML/XML texts. That seems high.

I’d like to do the math in reverse but need two variables. Aaron, can you provide estimates for the average number of characters per page and for the average number of pages per book? If we have those, we can plug them into the following formula to estimate cost per book in a different way: (.65 x X characters/page x Y pages x 25,000).

LikeLike
- Eleanor Shevlin Says:
  December 4, 2009 at 6:56 pm | Reply
  Anna, many thanks for initiating this look at actual costs–excellent move.
  
  [Also, for some reason, Aaron’s previous posting (dated today at 3:44 pm) did not show up for me until just now–as I was typing this response.]
  
  Let me just add that the estimated cost based on the formula, .65 x X characters/page x Y pages x 25,000, will represent the approximate cost of just the keying–and not any additional costs that the vendor might add (“FOB–shipping & handling, “processing”). Also the keying, while an absolutely major component, is only one step in the process; there’s also the cost of proofing the texts, etc., hosting space, etc.
  
  LikeLike
Aaron McCollough Says:
December 4, 2009 at 7:23 pm | Reply
Anna—

EEBO-TCP doesn’t have 423 individual partner institutions. For phase 1, we had approximately 150 institutions. Also, as I intended to indicate before, no grant monies have been forwarded to TCP to assist re-keying.

Our throughput during phase I was 4,993,027 kb @ .68.
So, the keying alone cost $3,395,258.

Your formula doesn’t figure in the cost of our reviewing staffs at UM and at Oxford, which have included between 3-4 full-time Oxford Reviewers, 3-4 full-time UM reviewers, plus part-timer reviewers and an Outreach Librarian (Shawn Martin and, for the last 6 months, me). In order to reconstruct the cumulative outlay on salaries, I would need to consult all of the old budgets, which I don’t have here at home, but I can say with confidence that they come to about the same amount as the keying.

LikeLike
- Anna Battigelli Says:
  December 4, 2009 at 10:59 pm | Reply
  Hi Aaron,
  
  Going to the list of “participating institutions” that you provided as a link, we find three categories: 1) Phase I & II Partners; 2) Phase I Partners; and 3) Non-Partners. In my earlier calculations, I failed to see that all of the institutions listed under “Phase I” were also listed (perhaps obviously) under “Phase I & II”. But deleting those redundancies removes only 64 from the 423 I counted, leaving 359. I suppose I should also have deleted the 184 “Non Partners” because they subscribe to EEBO but are not partners for TCP. Is that correct? That brings us to 175. Is that an accurate count of current partners? If so, that brings TCP subscriptions to something like 4.3 million (assuming $25,000 as an average subscription cost, which is, I think, low). This indeed leaves TCP in the red, or at least under the cost of 25,000 books at $272 ($6.8 million). If this is anything close to accurate, it’s sobering. And we have not factored in any other costs.
  
  I hope you will correct any errors in my admittedly rough estimates. It’s very helpful to have you help us think this through and important for us to understand the financial context of how something like the TCP works. Again, thanks for your help.
  AB
  
  LikeLike
Anna Battigelli Says:
December 4, 2009 at 7:43 pm | Reply
Yes, Eleanor, you are exactly right: I was using absolutes, as if all the money from subscriptions would be directed into producing TCP texts. I did this in part because I wanted an overall sense of the subscription fees and then the book fees. Obviously the actual budget would have to include additional costs: paper, ink, mailing, salaries, and so forth. But according to Aaron, this is not a high percentage of the overall budget.

In Aaron’s earlier note, he gives $272 as the price of producing a book (by which I assume he means a TCP text). That is lower than my number above but still much higher than I would have guessed before I started looking at the numbers.

Any additional information from Aaron or Jo-Anne about the variables I wanted or about other aspects of text production costs will help provide a fuller understanding of how much it costs to produce something that will, eventually, be freely available.

LikeLike
Eleanor Shevlin Says:
December 4, 2009 at 8:29 pm | Reply
[Once again, Aaron’s response was not visible when I posted the following comments.]

Again, many thanks, Anna…

I was actually thinking of additional, immediate costs of producing each TCP text/book. The production segment of the TCP website notes the following steps beyond the keying:

When the keyboarding and SGML/XML tagging is completed by the vendor, the file is transferred back to Michigan or Oxford for proofing and SGML tag review, to ensure that the work conforms to the established standard of 99.995% character accuracy. Works that are accepted are then matched to catalog records to create bibliographic headers and added to the existing online collection. Currently, between two and three hundred titles per month are added to the text file.

Thus, the production cost would also include expenses incurred for proofing and catalog-record matching.

LikeLike
Eleanor Shevlin Says:
December 4, 2009 at 11:59 pm | Reply
An awareness of the actual costs involved in producing quality encoded texts is important, so I’m glad our discussion has taken this turn. Such attention underscores the labor, time, and expense required to create editions such as those produced under TCP auspices and drives home the importance of encouraging our institutions and scholarly societies to support such collaborative efforts. This look at financial specifics for the TCP-EEBO texts also provides some perspective for considering other projects we’ve discussed on emob including Google Books (its deep pockets, sheer numbers of texts its already digitized, and multiple problems with metadata and bibliographic errors) to the more slowly progressing Internet Archive and Ben’s Eighteenth-Century Book Tracker. The realization of the price tags involved also stresses the challenges involved in finding solutions to the “digital divide” developing between scholars and in developing a means of achieving greater access for more scholars and their students.

LikeLike
Anna Battigelli Says:
December 5, 2009 at 9:02 am | Reply
Yes, these are all things we need to consider in order to appreciate the cost of a project like TCP.

It would also help us assess the “value” of a project like TCP if we had a clearer sense of the longevity of digital surrogates and the cost of that longevity. Microfilm got scratched and failed to hold up in the way that was expected. Now that it has been digitized, the machines used to read microfilm are more difficult to find, at least in small libraries. And when technologies migrate, as microfilm migrated to digitized texts, librarians sometimes discard the older technologies (microfilm collections, paper newspaper collections). Preservation is or ought to be a common goal and a key aspect of assessing “value.”

LikeLike
Eleanor Shevlin Says:
December 5, 2009 at 2:14 pm | Reply
Many digitization projects invoke preservation in arguing for their value. Not surprisingly, a commitment to preservation in the digital world is one of the tenet’s of TCP’s mission:
“Ensure that this content will migrate forward through shifts in technology to represent editions of enduring value to libraries.” The wording here usefully specifies TCP’s view of and plans for preservation: creating encoded texts that will be adaptable to technological changes and advances.

LikeLike
Anna Battigelli Says:
December 6, 2009 at 11:26 pm | Reply
Turning to a different aspect of EEBO-TCP, I am impressed with the TCP’s transcriptions. By adding a full-text search function, TCP allows the kind of searching ECCO delivers, at least for the 25,000 texts within EEBO that have been re-keyed. Additionally, EEBO’s full-text searching may be even more accurate than ECCO’s, since it does not rely on OCR software.

I am curious about the selection criteria and why only texts that appear in the New Cambridge Bibliography of English Literature are candidates for re-keying.

LikeLike
Eleanor Shevlin Says:
December 7, 2009 at 12:01 am | Reply
I may be wrong, but it seems that TCP’s work with EEBO is further along than its work with ECCO (the website menations 10,000 ECCO texts are being prepared, while over 25,000 have been completed for EEBO).

Aaron may be able to tell us more about the selection process, but I would surmise that the taskforce convened to make recommendations perhaps considered which texts might have the most demand.

The TCP website does make an important point that partner institutions can recommend texts for encoding:

if your institution is a TCP partner, and you would like to see us encode something specific, we encourage you to let us know, and we will do our best to accommodate you.

You can either email the author(s) and title(s) to me at tcp-info at umich dot edu, or you can fill out our html form at: http://www.lib.umich.edu/tcp/eebo/forms.html.

LikeLike
amccollo Says:
December 7, 2009 at 11:25 am | Reply
Although we cannot know for sure what Gale’s OCR texts look like (because they hide it), I can say that keyed text is dramatically superior to so-called “dirty ocr.” ECCO-TCP was started to address this problem (among others… like the perpetual commodity status of the digital page images), but the existence of ocr’d versions of the ECCO texts significantly undermined the value of the TCP editions in the minds of many collections officers. After all, the primary resource is very expensive. It will always been hard for libraries to justify paying for the image product PLUS an encoded text version. ECCO-TCP has more or less stalled for lack of funding.

As for EEBO-TCP, any text that is in the Proquest EEBO collection is fair game for conversion. Eleanor is correct that our initial selection criteria were based on expected demand. In our second phase, however, we are aiming for complete conversion of the unique monographs in the collection.

LikeLike
Anna Battigelli Says:
December 7, 2009 at 12:26 pm | Reply
Thanks, Aaron. Both those points are helpful.

18thConnect is working on develping their own OCR technology, which their website claims will be 99% reliable. I would like to know how accurate that is. If there are approximately 100 characters per sentence, wouldn’t that mean that one character per sentence would still be inaccurate? By contrast, EEBO-TCP texts are, according to the TCP website, 99.995% accurate. Perhaps Laura Mandell can tell us more about 18thConnect’s OCR accuracy and its implications.

The good news, as Laura announced on this site in August, is that Gale will let 18thConnect allow non-subscribers to ECCO search 18thConnect’s machine-read texts. Though non-subscribers will not be able to pull up either ECCO’s digital copy of the text or 18thConnect’s machine-read copy, they will have access to the bibliographical records of their hits. This is a move in the right direction, though it does not do away with the digital divide.

LikeLike
amccollo Says:
December 7, 2009 at 12:41 pm | Reply
TCP has been working with Laura and 18thConnect. You are right that 99% accuracy is still 1 bad character per 100. The kinds of errors that OCR produces have an interesting range, too, as discussed in recent D-LIB articles (esp, http://www.dlib.org/dlib/july09/munoz/07munoz.html ).

In our analysis, even 99.993% accuracy would prove almost unmanageable from a review/xml-encoding perspective. The increase in errors would likely slow the process down so much that it would be no more cost-effective than sourcing the texts through re-keying rather than OCR. We’re still hopeful, of course, but we will have to wait to see what the results end up looking like. My point, I guess, is that (as long as we’re talking about usable digital texts) review will still be an important part of the equation, no matter what, and we haven’t yet figured out how to make it viable with anything less than 99.995% accurate capture rates.

LikeLike
Eleanor Shevlin Says:
December 8, 2009 at 7:43 pm | Reply
Many thanks for your insights about TCP’s efforts. We have the Tanner’s D-LIB article on our bibliography, and we have found it quite helpful in illuminating OCR issues.

I do understand about the stalled funding for ECCO. I do think our discussion of costs lends a better understanding of the complex decisions involved in evaluating where to spend an always limied amount of funds.

LikeLike

Early Modern Online Bibliography