wiki:Internal/InventoryV3

Version 40 (modified by ssugrim, 14 years ago) ( diff )

James is working on a second generation inventory script.

Currently on: Restructuring writer and correcting table assumptions

Versions:

  • gather.rb:2.17
  • writer.rb:1.02 (deprecated)
  • logcopy:0.04 (deprecated)
  • inventoryv2.rb:0.05

It's plan is to be simpler and less ambitious than it's predecessor, but still respect the sql table structure ("as much as possible.")

There are 4 parts to this process:

  1. inventoryv2.rb: execs scripts on nodes using orbit framework
  2. gatherer.rb: collects information using only operating system based facilities (dmesg, lsusb, lspci, ifconfig, /sys, lshw).
  3. writer.rb: checks the mysql repository for changes from the current state. If different changes them.
  4. logcopy.rb: check the log file for errors, if present copies the Logs to a specfied location

The sql structure is a bit of a bit mess, the major tables of interest are:

  1. motherboards - List of things that can be connected to, has its own id used to tie other tables to it
  2. devices - List of deviced "connected" to mother boards
  3. device_kinds - type identifier for connected devices (an attribute of a device).
  4. locations - Converts x,y,testbed_id coordinates to a single integer
  5. nodes - maps motherboard to locations, also has an id for said mapping
  6. inventories - records the start and stop time of the inventory pass.
  7. testbeds - gives a testbed id for the specific domain, thus disambiguating node1-1

A lot of the tables are full of unused colums. I guess we'll just ignore them for now. The basic crux of an update should be the following:

  1. examine our IP (and hostname) to determine our current location
  2. We gather information about the mother board:
  3. Gatherer:
    1. Disk Size (dmesg)
    2. Memory Size (dmesg)
    3. Cpu number (dmesg)
    4. motherboard serial number (lshw)
    5. Gather information about attached devices:
      1. 2 wired Ethernet addresses (ifconfig, /sys)
      2. 2 wireless Ethernet addresses (ifconfig, /sys)
      3. any usb devices (lsusb, /sys)
    6. export to xml
  4. Writer:
    1. import xml output from gatherer
    2. collect identifiers from mysql based on gathered infromation (domain ⇒ testbed_id ;x,y,testbed_id ⇒ loc_id; mb_ser ⇒ mb_id ; loc_id ⇒ node_id ; )
    3. update mother board information if different, and stamp with current inventory number
    4. add kinds if they don't exist already
    5. update devices if diffrent and stamp with inventory number
    6. update node mb_id if loc, mb pair don't match
  5. profit.

Require Tools / Libraries

  1. lsusb (usbutils.deb)
  2. lspci (native)
  3. dmesg (native)
  4. ifconfig (native)
  5. libxml-simple-ruby.deb
  6. libmysql-ruby.deb
  7. lshw (lshw.deb)
  8. logger (ruby standard)
  9. ftools (ruby standard)

Gatherer: The disk size and memory size are a quick scan from dmesg. The disk size matches, but the memory size is a little off. It probably has to do with the way dmesg reports memory vs /sys reports memeory. It would be nice to find the /sys entry for consistency.

In /sys/devices/pci0000:00 are the sub directories correlated with the specific Ethernet hardware. In each directory that correlated to an Ethernet device there will be a symbolic link with the operating system name of the device. This will allow us to match up the pci address(name of the subdirectory of /sys/devices/pci0000:00) to the mac address (from ifconfig). lspci can tell us the associated pci address and a hardware identifier string.

lsusb on the otherhand offers a direct correlation to the device kind table, the ordered pair of numbers xxxx:yyyy directly correlated to the tables vendor and device ids. And the Bus xxx Device yyy number fits into the addres category of the device table.

9/29/09

I may have discovered the cause of the device / vendor discrepancy. Joe seems to be looking at /sys/class/net/devincename/device… perhaps this points to a different device id. I'll have to check it out.

That being said I have a working Gahterer protoype:

ssugrim@external2:~/scripts$ ruby gatherer.rb
ssugrim@external2:~/scripts$ more /tmp/external2.xml
<external2>
 <ip_adds>
  <10.50.0.12 iface='eth1' host='external2.orbit-lab.org'/>
  <127.0.0.1 iface='' host=''/>
 </ip_adds>
 <motherboard mem_size='1048512' disk_size='156301488' cpu_num='4'/>
 <Devices>
  <pci>
   <eth0 device='1229' bus_add='01:03.0' mac='00:e0:81:26:70:16' str='Ethernet controller: Intel Corporation 82546EB Gigabit Ethernet Controller (Copper) (rev 01)' vendor='8086'/>
   <eth1 device='1010' bus_add='04:01.0' mac='00:e0:81:26:76:9c' str='Ethernet controller: Intel Corporation 82546EB Gigabit Ethernet Controller (Copper) (rev 01)' vendor='8086'/>
   <eth2 device='1010' bus_add='04:01.1' mac='00:e0:81:26:76:9d' str='Ethernet controller: Intel Corporation 82546EB Gigabit Ethernet Controller (Copper) (rev 01)' vendor='8086'/>
  </pci>
  <usb>
   <0 device='0001' bus_add='001:001' str='Linux Foundation 1.1 root hub' vendor='1d6b'/>
  </usb>
 </Devices>
</external2>
ssugrim@external2:~/scripts$

10/2/09

Minus error checking for failed commands, the gatherer is complete. I'm now moving onto writer. I'm going to keep them in the same script for now, so I don't have to deal with reimporting the data and extracting it from xml, at some point that'll be a todo, so that way we can call just the gatherer if we want to.

Fow now, I need to determine what node I am based on the resolved host name. The scheme is nodex-y.testbedname# I can extract the x and y cooridnates from the node part, and then The testbed name will have to be a lookup. (this should probably be in gatherer as parameters.

Once I have that I can look up my unique mysql id from the mysql databse. This id will then allow me to correlate devices with the ones I have.


Following the instructions on http://support.tigertech.net/mysql-duplicate

I copied the mysql database from invetory1 to inventory2.

One Caveat is noted on http://forums.digitalpoint.com/showthread.php?t=259486

In the top of the database file you are trying to dump you will see that :
CREATE DATABASE `gunit_pimpjojo` DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;
Just remove this from the dump ( Notepad or wherever you have the dump)
Then re paste the file
You just need to remove that line....and you will be good to go

10/5/09

I've revamped the internal data types to facilitate the way xml-simple outputs, and re imports. Any multi argument results (eth, usb, ip) return and array of hashes. This creates clean xml. I also unfolded the cords has to single instance variables, they all get wrapped up into a single attribute.

The new xml format looks like so.

<opt x="1" y="1" disk_size="156301488" domain="sb7" cpu_num="1" mem_size="491456">
  <pci device="0013" name="ath0" bus_add="00:09.0" mac="00:60:b3:ac:2b:92" str="Ethernet controller: Atheros Communications, Inc. AR5212/AR5213 Multiprotocol MAC/baseband processor (rev 01)" vendor ="168c" />
  <pci device="0013" name="ath1" bus_add="00:0a.0" mac="00:60:b3:ac:2b:66" str="Ethernet controller: Atheros Communications, Inc. AR5212/AR5213 Multiprotocol MAC/baseband processor (rev 01)" vendor="168c" />
  <pci device="4320" name="eth0" bus_add="00:0b.0" mac="00:0f:ea:4a:8b:56" str="Ethernet controller: Marvell Technology Group Ltd. 88E8001 Gigabit Ethernet Controller (rev 13)" vendor="11ab" />
  <pci device="4320" name="eth1" bus_add="00:0c.0" mac="00:0f:ea:4a:8b:57" str="Ethernet controller: Marvell Technology Group Ltd. 88E8001 Gigabit Ethernet Controller (rev 13)" vendor="11ab" />
  <usb device="6001" name="usb" bus_add="001:009" str="Future Technology Devices International, Ltd FT232 USB-Serial (UART) IC" vendor="0403" />
  <ip ip="10.17.1.1" host="node1-1.sb7.orbit-lab.org" iface="eth1" />
  <ip ip="127.0.0.1" host="" iface="" />
</opt>

I've also gone to the original two script model. Gatherer is "feature complete".


Working on the writer I've created a internal data type called Xmldata, it's got exactly the same fields as Info, but populates them from the generated xml file.

Working on the mysql part of I have to examine the code that lives in

ssugrim@internal1:/opt/gridservices2-2.2.0/lib/ogs_inventory$

NOTE: mysql query strings should be crafted prior to the actual execusiton of the query, since they don't always form the way you think they do. Also the %&string& formulation of strings is very helpfull in getting the quotes correct.

10/08/09

The writer script is now equipped with two classes Xmldata, and Identify. Both can only be instantiated by the create command, making them singletons (create will only run new if one does not already exist). Identify instantiates an Xmldata class, and then uses the x and y coordinates and the domain to determine a location id (the global unique identifier that ties all the tables together.) I also get the max id from the Inventory ids, assuming that the highest number is the latest.

10/12/09

Quick edit to gatherer to convert the device and vendor tags to decimal instead of hex. The reason they didn't match before was because in the sql database, they are stored as decimal (I guess cuz you can't store hex in mysql).

10/18/09

Writer is "feature complete". The mail (non-data) class is Check_sql. Besides new, it's main methods are check and update. They respectively compare the xmldata against sql and update the database if the data doesn't match. I'd like to be more "indpendent" of the form of the xmldata, but that would involve a lot more dummy varibles and searching of hashes.

Big TODO is mostly rescuing errors. First on the list is connect retries. Class interface descriptions to follow soon.

10/20/09

Modified both gatherer and writer to take parameters. The paramters are as follows:

Writer:
 --server = #server hostname (default: internal1.orbit-lab.org)
 --user = #username for mysql
 --pass = #password
 --db = #database name (default: inventory2)
 --input = #input file name (default: foo.xml)

Gatherer: 
 --output = #name of outputfile (defualt: stdout)

Also now writer only checks vendor and device id. If no match is found it will add it with the description string.


10/26/09

Modifying gather to use lshw to collect uuid (motherboard serial number) also changing the internal data types to more closely match the table contents e.g devices and motherboards.

Originally I thought to just used lshw to gather most of the information, but this doesn't really gain us any thing since I would have to resort to the other tools (lsusb and lspci) to find the relevant entries in lshw output (and would require a huge rewrite of the gatherer). Lshw can output in a few diffrent ways. I'm currently using the direct line by line approach to search for the uuid. I did however experiment with the -xml output. When imported with XmlSimple.xml_in(), we get a massive hash/array structure that contains all the data elements as either a value in a hash or a item in an array. To find what we're looking for we need a recursive tool to extract the relevant data structures. An example was written in titled lshw_recursion_example.rb the main recursive tool keeps calling the each method of every sub-element (hashes and arrays both have an each method, they behave differently, thus a check for class is needed first).

One snag that was "hacked" in was that if we find the keyword in an array if we push the containing data type, all we get is an array with that keyword. I resolved this by passing the previous data structure as a parameter. If the keyword is found in a hash I store the hash, if it's found in an array, I store the previous hash. I opted to hunt for an list of words instead of a single one. Its more efficient than iterating across the entire structure multiple times for each word. We don't iterate through the words for every element, just the ones that are the termination of a branch. This saves a lot of computation and prevents a few type errors. Its assumed that the word list is significantly smaller than the size of the hash. Some example code:

found = Hash.new()
def hunt (dat,words,found,prev=nil)
                #check the type
		if dat.kind_of?(Array):
			dat.each do |v|
                                #iterate over the current type and check for an instance the words 
				words.each {|w| found.store(w,prev) if /#{w}/.match(v)} if v.kind_of?(String)

                                #recursively call the function on the children of this data structre
                                #note the parent is passed as a parameter as the array branch needs to store the container
				hunt(v,words,found,dat)
			end
		elsif dat.kind_of?(Hash)
			dat.each do |k,v| 
                                #same deal as the array cept we have a key,value combo, and we can store the current data
                                #data structure. We still need to pass the parent as a parameter since we don't know
                                #what type the child is
				words.each {|w| found.store(w,dat) if /#{w}/.match(v)} if v.kind_of?(String)
				hunt(v,words,found,dat) 
			end
		end

11/4/09

I'll need to revisit the use of recursion for lshw. I have some working ideas on how to do it. Ivan suggest multi tier iterations where I hunt for keywords following some kind of "path of keywords". Using the "hunt" multiple times with a sequence of keywords (examining keys as well as values), we should be able to iteratively extract smaller and smaller data structures that contain more relevant information.

More immediately are the changes that need to be made to write to reflect the table structure in the mysql they are:

  • Need to get mother board id from table matching against serial number
  • Update node to correlate mother board to location (when they move)
  • motherboard updates should only modify disk and memory (the mother board id should not change)
  • If a motherboard is not found then we insert it.
  • should get node id from sql table matching against location

11/17/09

Modifications on writer have been completed (preliminary checks worked).

  • reverted Db.hupdate to only update. The calling functions should decide whether to insert or update.
  • Mb_id nows checks against serial instead of location in the Identify class
  • update_mb now checks for mb_id. If the ID is present it will update the record otherwise it will insert WITHOUT specifying and ID since SQL should autoincremt the ids
  • Nodes are uniquely identified by a triple of (node_id, location_id, motherboar_id). Its assumed that the (node_id,location_id) portion is constant. Thus the only change/ update we should check for and preform is to ensure that the motherboard id matches for a given node_id, location_id pair. the update_node function only modifies motherboard_id's

Things that need to be done:

  • move all the "checks" into the get methods (depreciate the get methods). check() should simply call the sub_check methods and retain a hash of matches for specific data structures (table entries).
  • update can then iterate that hash and call the respective update functions for each give table.
  • to that end the update_device method needs to be split in two to reflect the data structures
  • the data structure design paradigm should be to have one data structure for each table that needs to be checked / update. It's pretty close to this, but needs a little work.

11/23/09

The previous was completed. There was a two bugs that needed tweaking

  • Update node did not update the inventory when it updated a node info.
  • Had to add a hack to prevent unknown dev_ids from getting double entered in update_adds when the id is unknown. If the device has multiple instances of a pice of unknown hardware (like 2 new Ethernet cards), the current routine will double add them.
    • this hack should be re-visited for efficiency, currently it double checks for a kind (in case one was added after the adds_array was populated). This is very wasteful as the missing kinds should be a rare event. I should probably switch to a different function or something if I've entered the rare "never seen it before" scenario.

11/24/09

Fixed a few bugs from previous edits:

  • I added a kind check/update that precededs the device update so that kinds are always populated before devices
  • each update now calls check (except for devices as they're the last). Update mb also repopulates the mb_id.
  • the mb_id information was moved from Identify to check_sql since it's dynamic and properly belongs there. Identify no longer has a mb_id method.

5/19/2010

Adding Error handling and logging to the writer script:

  • the require 'logger' + code to actually log Info and errors is now in place
  • For the Db.connect method I've added a begin/rescue/end block. It sleeps for 60 + rand(60) seconds then tries to connect again, but'll I'll have to fine tune its behavior to only reconnect when It can't reach instead of reconecting every time. it Also logs how long it's going to wait until it tries again.
  • Jack is stripping things for gatherer but we'll eventually merge that code.
  • there are 5 places where I need to try and catch exceptions, most of the other exceptions I want to go unhanded so it terminates: I'll need to put in a bunch of debug loggin to make sure my data is what I think it is.

5/22/2010

I've put a bunch of logging and error handling code into the writer version 0.99. It should now log appropriately. To generate a bunch of meaning full results I've taken these steps:

  • Writer now logs to /tmp/writer.log
  • Wrote a new script called logcopy.rb
    1. It checks the log for errors
    2. if errors are found I mount command the tmp directory on repo1
      repository1:/export/orbit/image/tmp /mnt as my mount
      
    3. In the tmp directory is a new directory called logs
    4. the /tmp/writer.log file is copied to this logs directory and stamped with the name of the node it came from
  • I've created a new image called inventoryV2.ndz which has all the updated scripts (writer and logcopy)
  • I've modified the inventoryV2.rb script to call logcopy as the final step. it's now named inventoryV2-1.rb

6/24/2010

I forgot to log a bunch of changes:

  • found a bug because I reduced the wait time for the experiment to complete. Since connect retry was rand(60)+(60) my log copies would copy over incomplete files, since it waited only 30 seconds to copy. I've redone the numbers, I wait only rand(20) to attempt to reconnect but now try 3 times. I wait 90 seconds before trying to copy the log file, this should give me enough time to capture the retries.
  • There was a cascade of query failures due to the fact that I was searching for the testbed id with the short domain name, instead of the FQDN in the test bed table. This value was given to me by the gatherer, I've since modified the gatherer to use the FQDN. All the other queries depended on this number so the error propagated down.
  • This error however demonstrated a specific flaw in how I handled empty query results. I indicate Failed queries by returning nil, instead of an array. The only reason I caught the error was because I tried to flatten the nil return. I've updated this to raise an exception if a nil query occurs for any of the members of the Identify class. Not being able to identify the node should be a fatal error. This exception is unhandled so it will propagate up to main block where it gets logged and terminates the script.
  • Also added a little bit of logging to gatherer but not much. I should really fix it's error handeling
  • Made Check_sql.check_in a public method and had it called last in the MAIN.
  • Noticed that sometimes I get a xml create object error. I'll have to figure out why that's happening. It's probably due to gatherer not completing properly. But Now I should be able to find the nodes where it happens.
  • Trying to stick to the logging/exception rasing convention of setting the error text to: "class.method - error"

6/25/2010

We're going live! I'll actively update inventory52.

Few minor tweaks:

  • Edited logcopy to wait only 30 seconds before unmounting
  • set default to be inventory52, since this is now going to be our main table.

Since I now checkin last and call it externally. I should include number of changed lines in my checkin. Would be helpful for diagnostics.

6/29/2010

After some thought, I realized that Writer should call logcopy as it's last action. This ensures that log copy copies a complete file. It avoids a timing problem where the Inventory script would have to guess a reasonable time for writer to complete. Logcopy is ensure a properly closed file as writer controls when logcopy is called. I could have put this in a at_Exit block, bit I just left it in the MAIN ensure block. I used the system call:

 system("/usr/bin/ruby /root/logcopy.rb")

Note the explicit paths. The OMF exec subshells don't understand relative paths. I could have used exec, but it replaces the current process with the one to be execed. While this could have worked it would have prematurely terminated writer with out closing out all the objects. That probably should matter, but it's not neat. While checking on the exection I noted that at the point where writer invokes logcopy,

The 3 mains competeing stratgies are system, exec, and %x[]. Where the last one is very similar to back ticks command . I guess there is also psopen and ps3open. System is good enough for these purposes since I only care about it execution, not out put.

I've created a new inventory image and inventory script to reflect this change: james_inv_2.ndz and inventory2-2.rb are a testing pair. They'll replace the last known good pair: james_inv.ndz and inventory2-1.rb

TODO Have logcopy do sanity check for files and replace them

11/16/2010

gatherer version 2 is now in the works. It's essentially a complete rewrite where we standardise on Lshw as the method of getting hardware information. Instead of doing regexp on lspci, lsusb and the like we instead will use lshw. There may be some issues with usb ethernet devices (depending on how lswh classifies them vs how the database treats them.) That'll have to be worked out.

I've revamped the argument processing (useing a standard library) so that it's much cleaner. This is done in the main function body. I'm also using the logger more liberally since this is expected to be a standard. By default logs goto STDOUT, but that can be changed with a flag. It would be nice to make output file path a mandatory argument, but I've not figured out how to do that yet.

The new model is much simpler since we only need lshw. There is a parent class called Component that defines a poll method. Each component should be able to poll for it's data, however the flag (to define what information to get from lshw) and the regexp are uninitialized arguments (no defaults so ther is much bitter complainting if I forget one, by design). I use popen3 to call lshw. This suppresses any extraneous output lshw makes. I also scan std error and raise and exception if I lshw is not found (it's a configurable parameter). Component should not be instantiate-able, instead all other datum should be derived from it, and then provide named accessors to the derived components.

11/19/2010

After trying to build the network class I realized that the common set of things that need to be done by each child does NOT involve searching. Meaning that each child will have to "scrape" the lines a little diffrently. However each child does need to run lshw, and should require it to spit out an array of lines or a "folded" array of lines where each fold occurs at a marker, typically the marker is *-, this is the default method of output. In version 2.08 onward I've replaced the collection of functions in Component(the parent) with a single function, lshw_arr which takes a flag (for the -c argument) and an optional marker. If the marker is specfied I use Array.search and Array.map to fold the array, other wise I just return it straight. All elements of the array are .strip(ed) and .flatten.compacted(ed) to ensure sanity of the return values. I don't want to pass around nested arrays. If something needs nesting I'll make it a new object, but return values should have at most 1 level of nesting.

The lshw webpage is http://ezix.org/project/wiki/HardwareLiSter This page has the device class listings and other documentaion.

12/8/2010

Fixed a few bugs involving how the array slices were processed. I ran into a problem where the device count would get messed up if lshw returned non-unique values. I solved it be reversing the array before slicing (documented in the code).

1/8/2010

I've added the sql_query method to the main component class. All child classes will query sql the same way. The connect and disconncet methods are also component class Methods. I though previously I had added a discussion about various connections models. The two competeing ideas were:

  • Let each child handle their own connections
  • Let the main program handle the connection.

I decided to put the connection task in the main function because less connection attempts would be made (this should be more stable) and I can use a single begin/rescue/ensure block to make sure the connection closes.

I've added an abstract method (see here)in the component called update (in the parent it raises a NotImplemented error). This forces all children to have an update method (which should then be implemented).

I used an interesting construction in the sql_query method, thats worth mentioning. If given an array of parameters to glue together, I can string them with out a trailing separator like so A=[A,B,C,D] A.first(A.length-1).map{|a| a + ","} + A.last. This gets the the elements as a string with separators (join might also take a separator argument which might be cleaner, but doesn't work in the case if some elements are nil).

I also started using Array.zip to iterate two arrays with map. If I have A and B arrays and I need to do something to both. I can do C = A.zip(B).map{|x| f(x[0],x[1])}. I'm doing this with the individual data elements that need to be pushed back into the Tables, that way I can just zip the indvidual data arrays into one big one later. See the kind construct.

Finally, the sql_query method has a .flatten in the last statmenet now since the row and query operations return a nested array. I've made it a policy that the query should return a flat array of answers.

1/12/2011

I've made a fundemental mistake in my assumptions. The update method should not just insert things that are missing, it should also check that the data matches what is currently on the database.

I'll need an insert and update method, in component.

1/14/2011

The component sql_insert and sql_update methods have been added. I might reconsider how I pass parameters in the latter since I'd have to fold things into a hash before I passed them up, it might be cleaner to use a 2 arrays and zip them, rather than passing a hash. Now the update method needs to be redone

Note: See TracWiki for help on using the wiki.